Open chunwahchung opened 3 years ago
Everyone will download dataset. Everyone does different days. Hydrating takes a very long time. Twitter has a cap.
Cagri will need to have a discord call to show us the process
@Youssgui see if you can come up with a mapreduce solution for this
Cagri expected time for hydration is based on: 1.5million english tweets Dec 21st 500k of these tweets took 9 hours
Feel free to note the execution time when you run the script
I downloaded the zipped file "2020-12-21-dataset.tsv.gz". This zip file only includes the tweet IDs of 1.05 million tweets. It can be unzipped with the script. Downloading and unzipping are very fast.
To be able to be used, those tweet IDs need to be hydrated, which takes very long. This is what we need to share. Tweets are hydrated at a speed of 100/5sec due to Twitter pull request cap. Therefore it took 9 hours to hydrate 500k tweets last time.
As of today, the dates of interest to us are
These days are selected based on the increase rate compared to the previous days. More appropriate dates will be found as discussed in Issue#2 Everyone can download a day from the candidate date list above.
Start downloading the dataset Must be in CSV at least
Blocked by #1
Acceptance Criteria: