download dataset based on important date - Githubissues

cagri32 / Analyzing-the-Extent-of-Polarization-around-COVID-19-Policies-using-Social-Media

In this project, we wish to explore the extent of polarization on Twitter. Our goal is to identify ideological communities corresponding to the pro-vs-anti vaccine movements, the nature of their interactions, to provide us with an approximate score of the extent of polarization, and to determine whether each of these communities share a common set of interests.

0 stars 0 forks source link

download dataset based on important date #3

Open chunwahchung opened 3 years ago

chunwahchung commented 3 years ago

Start downloading the dataset Must be in CSV at least

Blocked by #1

Acceptance Criteria:

[ ] csv of dataset for each important date
[ ] documentation (markdown file) for download process
[ ] documentation for changing the fields to include
[ ] when downloading & hyrdating dataset, say what date and what fields are included
[ ] upload the hydrated tweets to github under /data directory

chunwahchung commented 3 years ago

Everyone will download dataset. Everyone does different days. Hydrating takes a very long time. Twitter has a cap.

chunwahchung commented 3 years ago

Cagri will need to have a discord call to show us the process

chunwahchung commented 3 years ago

@Youssgui see if you can come up with a mapreduce solution for this

chunwahchung commented 3 years ago

Cagri expected time for hydration is based on: 1.5million english tweets Dec 21st 500k of these tweets took 9 hours

chunwahchung commented 3 years ago

Feel free to note the execution time when you run the script

cagri32 commented 3 years ago

I downloaded the zipped file "2020-12-21-dataset.tsv.gz". This zip file only includes the tweet IDs of 1.05 million tweets. It can be unzipped with the script. Downloading and unzipping are very fast.

To be able to be used, those tweet IDs need to be hydrated, which takes very long. This is what we need to share. Tweets are hydrated at a speed of 100/5sec due to Twitter pull request cap. Therefore it took 9 hours to hydrate 500k tweets last time.

As of today, the dates of interest to us are

July 13th, 2020 (3.6M tweets),
Sept 9th, 2020 (3.1M tweets),
Oct 2-3rd, 2020 (3.8M tweets)
Dec 21st, 2020 (2.4M tweets). ( I downloaded the data for this date)

These days are selected based on the increase rate compared to the previous days. More appropriate dates will be found as discussed in Issue#2 Everyone can download a day from the candidate date list above.