digitalmethodsinitiative / dmi-tcat

Digital Methods Initiative - Twitter Capture and Analysis Toolset
Apache License 2.0
367 stars 114 forks source link

slow download when media is flagged #299

Open lrossi79 opened 6 years ago

lrossi79 commented 6 years ago

Hello,

I've looked around and I can't really find anything about this. The issue is rather strange and I'm wondering if it's configuration issue. I have a reasonably large archive (almost 2M tweets) that I need to export. I need the urls to be resolved but when i flag "media" to have the resolved url in the exported file the download speed is incredibly slow (and that makes impossible to download a 1Gb file). The download speed is normal if "media" is not selected.
I really don't understand this but maybe is some kind of known issue.

Thank you!

dentoir commented 6 years ago

Hi @lrossi79

With download speed, do you mean the time you spend waiting on the analysis panel before the download actually starts, or do you mean the download speed in terms of kb/s ?

lrossi79 commented 6 years ago

no, I mean the actual download speed in terms of kb/s. We looked a little bit into the issue and it seems that the actual download starts while the db is still joining the tables (so before the file is actually fully generated). Does this make any sense?

dentoir commented 6 years ago

No that doesn't make sense, those MySQL queries you see may be unrelated or part of a 'orphaned' query, which happens when you start some analysis, but close the browser tab (the query keeps running in the background). Just to be sure, you could restart MySQL to get rid of these queries and try again. But... the CSV file will only be made available to the user once the script has finished, therefore TCAT should never be of influence on download speed. The CSV is just a static file sitting in the analysis/cache directory which will be served directly by Apache. Are you sure this isn't a situation where you have something (an ISP / hosting provider) throttling your bandwidth after a certain number of megabytes has been downloaded?

lrossi79 commented 6 years ago

Hi, I agree this is an unsual behaviour but the MySQL quesry is the one that is generating the file while the file is being downloaded. If I stop the download the query immediatly stops. Moreover the csv is not made avaialable through a link (as in the other csv fies) but it starts immediatly (and there is no csv file stored in the cache folder). (This is not a ISP issue since I'm downloading the file from the same network of the tcat server and if I do not flag the media option the download speed goes back to normal).

dentoir commented 6 years ago

Hi @lrossi79

You're correct about the export function immediately returning, I forgot, so it makes sense the query is running. As you are monitoring the MySQL server, can you send us the SQL statement with the JOIN which is taking a very long time?

Best,

Emile

xmacex commented 6 years ago

Hi. I'm looking at the same TCAT instance as @lrossi79. Exporting from a bin with 1.2M selected tweets, this I believe is the SQL query without any of the additional mentions, links or media columns selected:

SELECT * FROM the_bin_name_tweets t  WHERE  t.created_at >= '2017-10-05 00:00:00' AND t.created_at <= '2018-02-13 23:59:59'  ORDER BY id

That causes no issues. It takes some seconds to get collected, and the 667MB file downloads at good speed.

A problematic export, with the media columns included, has the same query:

SELECT * FROM the_bin_name_tweets t WHERE  t.created_at >= '2017-10-05 00:00:00' AND t.created_at <= '2018-02-13 23:59:59'  ORDER BY id

But it is grinding a very large number of these, at rate of approximately 10 per second.

SELECT * FROM the_bin_name_media WHERE tweet_id = 916039261803118593
SELECT * FROM the_bin_name_media WHERE tweet_id = 916039373946261504
SELECT * FROM the_bin_name_media WHERE tweet_id = 916039385988124673
⋮

These are generated by analysis/mode.export_tweets.php line 90. TCAT does not ask the RDBMS to join the _tweets and _media tables, but instead queries an ordered lists of tweets to export, and if so requested by the user, queries the database for the additional fields one tweet at a time.

I cancelled the query after some thousands. The _media table in question contains ~465000 items, so I estimate it would have completed in 465000 / 10 / 60 / 60 = 13 hours, assuming that steady rate. I didn't do any regression modelling etc. to investigate randomness, or reason about how _tweet ordered by id and unordered _media might correlate given monotonically increasing primary keys etc.

For comparison, the following, conceptually similar SQL query outside of TCAT did not complete in 2.5 hours on our setup.

SELECT *
FROM the_bin_name_tweets t
  LEFT JOIN the_bin_name_media m ON m.tweet_id = t.id;

For the one bin I was investigating, we didn't have an index for the column tweet_id in the _media table. Interestingly, and for reasons I don't fully understand, an inner join, instead of LEFT JOIN completed in about 15 seconds.

So I investigated schema and indices, and added an index with CREATE INDEX tweet_id ON the_bin_name_media (tweet_id);. Now the LEFT JOIN runs in 30 seconds (prev. > or ≫ 2.5 hours), with the following profile

Status Duration
starting 0.000043
checking permissions 0.000004
checking permissions 0.000003
Opening tables 0.000019
System lock 0.000009
init 0.000032
optimizing 0.000004
statistics 0.000016
preparing 0.000010
executing 0.000003
Sending data 7.202549
end 0.000025
query end 0.000004
closing tables 0.000010
freeing items 0.000016
logging slow query 0.000003
logging slow query 0.000035
cleaning up 0.000038

This I believe solved, or at least considerably improved the issue with TCAT exports that @lrossi79 reports. Please observe that each of the bins need to be re-indexed (let's talk about it). I am currently on a slow network but download speed from TCAT export with media improved at least three orders of magnitude after adding the index for _media.tweet_id. For selecting the media fields, at least. I will investigate further, and it should be checked that relevant indexes are put in place for mentions, hashtags and media when bins are created.

Please note all my reports of times are just from naïve runs, not using Monte Carlo methods for estimates, and without controlling other processes competing for the computing resources.

dentoir commented 6 years ago

Hi @xmacex

Thank you for the excellent analysis! I'll study this shortly. As for adding indexes, the procedure should go via common/upgrade.php which allows for pretty fine-tuned upgrade mechanics. Reindexing the media table will at least take less time than reindexing the tweets tables.

One question, did you benchmark the original mod_export script (without the JOIN syntax) after you've added the index?

xmacex commented 6 years ago

Hi @dentoir. Without the index, the queries of this particular TCAT instance, for this particular bin with ~1.2M items and ~500000 media items the mod.export_tweets.php with the media fields selected ran at ~10 queries per second. Now, with the the_bin_name_media.tweet_id index in place, it runs at approximately 5500 per queries second.

Again the same caveat, I'm doing this without a whole lot of scientific measuring or fancy PHP profiling – I am basically just setting general_log on for MySQL, doing the export, and then parsing the logfile with grep and awk, after I have identified the query ID for the export.