albertvillanova commented 2 years ago

uid: 100_days_of_covid_19_in_the_australian_twittersphere
type: processed
description:
- name: 100 days of COVID-19 in the Australian Twittersphere
- description: The 100 days of COVID-19 in the Australian Twittersphere dataset consists of 2.8 million tweet IDs corresponding to tweets from the Australian Twittersphere that mention the COVID-19 pandemic, either through Coronavirus specific hashtags or keywords. The tweets were created on or after 20 January 2020, and up until 23 May 2020 (the 15 weeks that form the first ‘100 days’ of COVID-19 in Australia).
- homepage: https://researchdata.edu.au/100-days-covid-australian-twittersphere/1467589
- validated: True
languages:
- language_names:
- English
- language_comments: Australian English
- language_locations:
- Oceania
- validated: False
custodian:
- name: Queensland University of Technology
- in_catalogue:
- type: A university or research institution
- location: Australia
- contact_name: Dr Marissa Takahashi
- contact_email: digitalobservatory@qut.edu.au
- contact_submitter: False
- additional: https://www.qut.edu.au/research/why-qut/infrastructure/digital-observatory
- validated: False
availability:
- procurement:
- for_download: Yes - it has a direct download link or links
- download_url: https://data.researchdatafinder.qut.edu.au/dataset/100-days-of1
- download_email:
- licensing:
- has_licenses: Yes
- license_text:
- license_properties:
  - research use
  - non-commercial use
- license_list:
  - cc-by-4.0: Creative Commons Attribution 4.0 International
- pii:
- has_pii: Unclear
- generic_pii_likely:
- generic_pii_list:
- numeric_pii_likely:
- numeric_pii_list:
- sensitive_pii_likely:
- sensitive_pii_list:
- no_pii_justification_class: general knowledge not written by or referring to private persons
- no_pii_justification_text:
- validated: False
processed_from_primary:
- from_primary: Taken from primary source
- primary_availability: Yes - they are fully available
- primary_license: Yes - the dataset curators have obtained consent from the source material owners
- primary_types:
- web | social media
- validated: False
- from_primary_entries:
media:
- category:
- text
- text_format:
- other
- .XLSX
- audiovisual_format:
- image_format:
- database_format:
- text_is_transcribed: No
- instance_type: post
- instance_count: 1M<n<1B
- instance_size: 10<n<100
- validated: False
fname: 100_days_of_covid_19_in_the_australian_twittersphere.json

cakiki commented 2 years ago

self-assign

cakiki commented 2 years ago

This dataset consists of three .xlsx files of Tweet IDs. Use of this dataset to rehydrate tweets is solely for non-commercial research purposes and subject to Twitter's terms, including: Twitter Terms of Service, Privacy Policy, Developer Agreement and Policy. It is also a condition of use of the dataset that you provide attribution of the dataset to the Digital Observatory.

source: https://researchdatafinder.qut.edu.au/display/n10613

albertvillanova commented 2 years ago

I think it is OK with these requirements.

Also see in the catalogue entry:

primary_license: Yes - the dataset curators have obtained consent from the source material owners

cakiki commented 2 years ago

Consolidated all three excel sheets into one .csv using:

t_0 = pd.read_excel('./coronatweetids0.xlsx', sheet_name="Sheet1")
t_1 = pd.read_excel('./coronatweetids1.xlsx', sheet_name="Sheet1")
t_2 = pd.read_excel('./coronatweetids2.xlsx', sheet_name="Sheet1")
t_0["file"] = "file_0"
t_1["file"] = "file_1"
t_2["file"] = "file_2"
df = pd.concat([t_0, t_1, t_2]).reset_index(drop=False).rename(columns={'index':'file_index'})
df.to_csv('./dataset.csv')

https://huggingface.co/datasets/bigscience-catalogue-data/100_days_of_covid_19_in_the_australian_twittersphere

Will rehydrate Tweets next.

albertvillanova commented 2 years ago

I guess this dataset needs text content for each tweet from dataset:

[ ] #133

I have compressed the data file and checked it loads OK:

{'tweet_id': 1219627299085012992}

cakiki commented 2 years ago

Yes, I still haven't managed to find time to rehydrate the dataset. I will get to it this weekend.

cakiki commented 2 years ago

@albertvillanova I've rehydrated the dataset but there are two problems:

Half the tweets (50.74% out of 2,841,125 Tweets) can no longer be retrieved as they were either deleted or their authors went private. (Common problem in rehydration I've heard)
A lot of the tweets are actually retweets (32.71% out of 1,399,387 Tweets) and are therefore truncated like so:
```
'RT @COFMadrid: Con motivo del 30 aniversario, @FarmaSinFronter acerca el arte solidario a favor del área materno-infantil en la ciudad de T…'
```
(Not sure about the language distribution of the data either)

We can't really do anything about Problem 1.

Problem 2 can be solved using a second run through the data. Let me know if I should retrieve the original tweets (On second thought, I suspect that a lot of them are already part of the corpus and therefore duplicates that are of no interest to a language modeling task.)

For reference in case someone wants to rehydrate a tweets dataset later in the project, this is how I used Twitter API v2 to do it: Keep in mind that this ran almost a full day (Most of it spent in sleep as it hit the rate limit 94 times, each time waiting for around 780 seconds), so it might not the best code.

import tweepy
import pandas as pd
from itertools import zip_longest, chain

# https://docs.python.org/3/library/itertools.html#itertools-recipes
def batcher(iterable, n, fillvalue=19*'1'):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

client = tweepy.Client(bearer_token='XXXXXXXXXXXX', wait_on_rate_limit=True)
tweets_df = pd.read_csv('./coronatweetids.csv.gz', compression='gzip')

tweet_list = []
for batch in batcher(tweets_df['tweet_id'].tolist(), 100):
    ids = list(batch) 
    tweet_list.append(client.get_tweets(ids)[0])
tweets = list(chain.from_iterable(tweet_list))
pickle.dump(tweets, open("tweets.pkl", "wb"))
df = pd.DataFrame([dict(t) for t in tweets]).rename(columns={'id': 'tweet_id'})
tweets_df = tweets_df.merge(df, on='tweet_id', how='left')
tweets_df.to_csv('./coronatweets.csv')

CPU times: user 8min 4s, sys: 8.06 s, total: 8min 12s
Wall time: 23h 30min 47s

albertvillanova commented 2 years ago

Thanks @cakiki.

Let's keet this dataset for the moment out of the final LM scripts...

bigscience-workshop / data_tooling

Create dataset 100_days_of_covid_19_in_the_australian_twittersphere #103

self-assign