Open albertvillanova opened 3 years ago
This dataset consists of three .xlsx files of Tweet IDs. Use of this dataset to rehydrate tweets is solely for non-commercial research purposes and subject to Twitter's terms, including: Twitter Terms of Service, Privacy Policy, Developer Agreement and Policy. It is also a condition of use of the dataset that you provide attribution of the dataset to the Digital Observatory.
source: https://researchdatafinder.qut.edu.au/display/n10613
I think it is OK with these requirements.
Also see in the catalogue entry:
primary_license: Yes - the dataset curators have obtained consent from the source material owners
Consolidated all three excel sheets into one .csv using:
t_0 = pd.read_excel('./coronatweetids0.xlsx', sheet_name="Sheet1")
t_1 = pd.read_excel('./coronatweetids1.xlsx', sheet_name="Sheet1")
t_2 = pd.read_excel('./coronatweetids2.xlsx', sheet_name="Sheet1")
t_0["file"] = "file_0"
t_1["file"] = "file_1"
t_2["file"] = "file_2"
df = pd.concat([t_0, t_1, t_2]).reset_index(drop=False).rename(columns={'index':'file_index'})
df.to_csv('./dataset.csv')
Will rehydrate Tweets next.
I guess this dataset needs text content for each tweet from dataset:
I have compressed the data file and checked it loads OK:
{'tweet_id': 1219627299085012992}
Yes, I still haven't managed to find time to rehydrate the dataset. I will get to it this weekend.
@albertvillanova I've rehydrated the dataset but there are two problems:
'RT @COFMadrid: Con motivo del 30 aniversario, @FarmaSinFronter acerca el arte solidario a favor del área materno-infantil en la ciudad de T…'
(Not sure about the language distribution of the data either)
We can't really do anything about Problem 1.
Problem 2 can be solved using a second run through the data. Let me know if I should retrieve the original tweets (On second thought, I suspect that a lot of them are already part of the corpus and therefore duplicates that are of no interest to a language modeling task.)
For reference in case someone wants to rehydrate a tweets dataset later in the project, this is how I used Twitter API v2
to do it:
Keep in mind that this ran almost a full day (Most of it spent in sleep
as it hit the rate limit 94 times, each time waiting for around 780 seconds), so it might not the best code.
import tweepy
import pandas as pd
from itertools import zip_longest, chain
# https://docs.python.org/3/library/itertools.html#itertools-recipes
def batcher(iterable, n, fillvalue=19*'1'):
args = [iter(iterable)] * n
return zip_longest(*args, fillvalue=fillvalue)
client = tweepy.Client(bearer_token='XXXXXXXXXXXX', wait_on_rate_limit=True)
tweets_df = pd.read_csv('./coronatweetids.csv.gz', compression='gzip')
tweet_list = []
for batch in batcher(tweets_df['tweet_id'].tolist(), 100):
ids = list(batch)
tweet_list.append(client.get_tweets(ids)[0])
tweets = list(chain.from_iterable(tweet_list))
pickle.dump(tweets, open("tweets.pkl", "wb"))
df = pd.DataFrame([dict(t) for t in tweets]).rename(columns={'id': 'tweet_id'})
tweets_df = tweets_df.merge(df, on='tweet_id', how='left')
tweets_df.to_csv('./coronatweets.csv')
CPU times: user 8min 4s, sys: 8.06 s, total: 8min 12s
Wall time: 23h 30min 47s
Thanks @cakiki.
Let's keet this dataset for the moment out of the final LM scripts...