bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 49 forks source link

Create dataset 100_days_of_covid_19_in_the_australian_twittersphere #103

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago
cakiki commented 2 years ago

self-assign

cakiki commented 2 years ago

This dataset consists of three .xlsx files of Tweet IDs. Use of this dataset to rehydrate tweets is solely for non-commercial research purposes and subject to Twitter's terms, including: Twitter Terms of Service, Privacy Policy, Developer Agreement and Policy. It is also a condition of use of the dataset that you provide attribution of the dataset to the Digital Observatory.

source: https://researchdatafinder.qut.edu.au/display/n10613

albertvillanova commented 2 years ago

I think it is OK with these requirements.

Also see in the catalogue entry:

primary_license: Yes - the dataset curators have obtained consent from the source material owners

cakiki commented 2 years ago

Consolidated all three excel sheets into one .csv using:

t_0 = pd.read_excel('./coronatweetids0.xlsx', sheet_name="Sheet1")
t_1 = pd.read_excel('./coronatweetids1.xlsx', sheet_name="Sheet1")
t_2 = pd.read_excel('./coronatweetids2.xlsx', sheet_name="Sheet1")
t_0["file"] = "file_0"
t_1["file"] = "file_1"
t_2["file"] = "file_2"
df = pd.concat([t_0, t_1, t_2]).reset_index(drop=False).rename(columns={'index':'file_index'})
df.to_csv('./dataset.csv')

https://huggingface.co/datasets/bigscience-catalogue-data/100_days_of_covid_19_in_the_australian_twittersphere

Will rehydrate Tweets next.

albertvillanova commented 2 years ago

I guess this dataset needs text content for each tweet from dataset:

I have compressed the data file and checked it loads OK:

{'tweet_id': 1219627299085012992}
cakiki commented 2 years ago

Yes, I still haven't managed to find time to rehydrate the dataset. I will get to it this weekend.

cakiki commented 2 years ago

@albertvillanova I've rehydrated the dataset but there are two problems:

  1. Half the tweets (50.74% out of 2,841,125 Tweets) can no longer be retrieved as they were either deleted or their authors went private. (Common problem in rehydration I've heard)
  2. A lot of the tweets are actually retweets (32.71% out of 1,399,387 Tweets) and are therefore truncated like so:
    'RT @COFMadrid: Con motivo del 30 aniversario, @FarmaSinFronter acerca el arte solidario a favor del área materno-infantil en la ciudad de T…'

    (Not sure about the language distribution of the data either)

We can't really do anything about Problem 1.

Problem 2 can be solved using a second run through the data. Let me know if I should retrieve the original tweets (On second thought, I suspect that a lot of them are already part of the corpus and therefore duplicates that are of no interest to a language modeling task.)

For reference in case someone wants to rehydrate a tweets dataset later in the project, this is how I used Twitter API v2 to do it: Keep in mind that this ran almost a full day (Most of it spent in sleep as it hit the rate limit 94 times, each time waiting for around 780 seconds), so it might not the best code.

import tweepy
import pandas as pd
from itertools import zip_longest, chain

# https://docs.python.org/3/library/itertools.html#itertools-recipes
def batcher(iterable, n, fillvalue=19*'1'):
    args = [iter(iterable)] * n
    return zip_longest(*args, fillvalue=fillvalue)

client = tweepy.Client(bearer_token='XXXXXXXXXXXX', wait_on_rate_limit=True)
tweets_df = pd.read_csv('./coronatweetids.csv.gz', compression='gzip')

tweet_list = []
for batch in batcher(tweets_df['tweet_id'].tolist(), 100):
    ids = list(batch) 
    tweet_list.append(client.get_tweets(ids)[0])
tweets = list(chain.from_iterable(tweet_list))
pickle.dump(tweets, open("tweets.pkl", "wb"))
df = pd.DataFrame([dict(t) for t in tweets]).rename(columns={'id': 'tweet_id'})
tweets_df = tweets_df.merge(df, on='tweet_id', how='left')
tweets_df.to_csv('./coronatweets.csv')
CPU times: user 8min 4s, sys: 8.06 s, total: 8min 12s
Wall time: 23h 30min 47s
albertvillanova commented 2 years ago

Thanks @cakiki.

Let's keet this dataset for the moment out of the final LM scripts...