Open lrossi79 opened 6 years ago
Hi @lrossi79
With download speed, do you mean the time you spend waiting on the analysis panel before the download actually starts, or do you mean the download speed in terms of kb/s ?
no, I mean the actual download speed in terms of kb/s. We looked a little bit into the issue and it seems that the actual download starts while the db is still joining the tables (so before the file is actually fully generated). Does this make any sense?
No that doesn't make sense, those MySQL queries you see may be unrelated or part of a 'orphaned' query, which happens when you start some analysis, but close the browser tab (the query keeps running in the background). Just to be sure, you could restart MySQL to get rid of these queries and try again. But... the CSV file will only be made available to the user once the script has finished, therefore TCAT should never be of influence on download speed. The CSV is just a static file sitting in the analysis/cache directory which will be served directly by Apache. Are you sure this isn't a situation where you have something (an ISP / hosting provider) throttling your bandwidth after a certain number of megabytes has been downloaded?
Hi, I agree this is an unsual behaviour but the MySQL quesry is the one that is generating the file while the file is being downloaded. If I stop the download the query immediatly stops. Moreover the csv is not made avaialable through a link (as in the other csv fies) but it starts immediatly (and there is no csv file stored in the cache folder). (This is not a ISP issue since I'm downloading the file from the same network of the tcat server and if I do not flag the media option the download speed goes back to normal).
Hi @lrossi79
You're correct about the export function immediately returning, I forgot, so it makes sense the query is running. As you are monitoring the MySQL server, can you send us the SQL statement with the JOIN which is taking a very long time?
Best,
Emile
Hi. I'm looking at the same TCAT instance as @lrossi79. Exporting from a bin with 1.2M selected tweets, this I believe is the SQL query without any of the additional mentions, links or media columns selected:
SELECT * FROM the_bin_name_tweets t WHERE t.created_at >= '2017-10-05 00:00:00' AND t.created_at <= '2018-02-13 23:59:59' ORDER BY id
That causes no issues. It takes some seconds to get collected, and the 667MB file downloads at good speed.
A problematic export, with the media columns included, has the same query:
SELECT * FROM the_bin_name_tweets t WHERE t.created_at >= '2017-10-05 00:00:00' AND t.created_at <= '2018-02-13 23:59:59' ORDER BY id
But it is grinding a very large number of these, at rate of approximately 10 per second.
SELECT * FROM the_bin_name_media WHERE tweet_id = 916039261803118593
SELECT * FROM the_bin_name_media WHERE tweet_id = 916039373946261504
SELECT * FROM the_bin_name_media WHERE tweet_id = 916039385988124673
⋮
These are generated by analysis/mode.export_tweets.php
line 90. TCAT does not ask the RDBMS to join the _tweets
and _media
tables, but instead queries an ordered lists of tweets to export, and if so requested by the user, queries the database for the additional fields one tweet at a time.
I cancelled the query after some thousands. The _media
table in question contains ~465000 items, so I estimate it would have completed in 465000 / 10 / 60 / 60 = 13 hours, assuming that steady rate. I didn't do any regression modelling etc. to investigate randomness, or reason about how _tweet
ordered by id
and unordered _media
might correlate given monotonically increasing primary keys etc.
For comparison, the following, conceptually similar SQL query outside of TCAT did not complete in 2.5 hours on our setup.
SELECT *
FROM the_bin_name_tweets t
LEFT JOIN the_bin_name_media m ON m.tweet_id = t.id;
For the one bin I was investigating, we didn't have an index for the column tweet_id
in the _media
table. Interestingly, and for reasons I don't fully understand, an inner join, instead of LEFT JOIN completed in about 15 seconds.
So I investigated schema and indices, and added an index with CREATE INDEX tweet_id ON the_bin_name_media (tweet_id);
. Now the LEFT JOIN runs in 30 seconds (prev. > or ≫ 2.5 hours), with the following profile
Status | Duration |
---|---|
starting | 0.000043 |
checking permissions | 0.000004 |
checking permissions | 0.000003 |
Opening tables | 0.000019 |
System lock | 0.000009 |
init | 0.000032 |
optimizing | 0.000004 |
statistics | 0.000016 |
preparing | 0.000010 |
executing | 0.000003 |
Sending data | 7.202549 |
end | 0.000025 |
query end | 0.000004 |
closing tables | 0.000010 |
freeing items | 0.000016 |
logging slow query | 0.000003 |
logging slow query | 0.000035 |
cleaning up | 0.000038 |
This I believe solved, or at least considerably improved the issue with TCAT exports that @lrossi79 reports. Please observe that each of the bins need to be re-indexed (let's talk about it). I am currently on a slow network but download speed from TCAT export with media improved at least three orders of magnitude after adding the index for _media.tweet_id
. For selecting the media fields, at least. I will investigate further, and it should be checked that relevant indexes are put in place for mentions, hashtags and media when bins are created.
Please note all my reports of times are just from naïve runs, not using Monte Carlo methods for estimates, and without controlling other processes competing for the computing resources.
Hi @xmacex
Thank you for the excellent analysis! I'll study this shortly. As for adding indexes, the procedure should go via common/upgrade.php
which allows for pretty fine-tuned upgrade mechanics. Reindexing the media table will at least take less time than reindexing the tweets tables.
One question, did you benchmark the original mod_export script (without the JOIN syntax) after you've added the index?
Hi @dentoir. Without the index, the queries of this particular TCAT instance, for this particular bin with ~1.2M items and ~500000 media items the mod.export_tweets.php
with the media fields selected ran at ~10 queries per second. Now, with the the_bin_name_media.tweet_id
index in place, it runs at approximately 5500 per queries second.
Again the same caveat, I'm doing this without a whole lot of scientific measuring or fancy PHP profiling – I am basically just setting general_log
on for MySQL, doing the export, and then parsing the logfile with grep and awk, after I have identified the query ID for the export.
Hello,
I've looked around and I can't really find anything about this. The issue is rather strange and I'm wondering if it's configuration issue. I have a reasonably large archive (almost 2M tweets) that I need to export. I need the urls to be resolved but when i flag "media" to have the resolved url in the exported file the download speed is incredibly slow (and that makes impossible to download a 1Gb file). The download speed is normal if "media" is not selected.
I really don't understand this but maybe is some kind of known issue.
Thank you!