Open albertvillanova opened 2 years ago
I think there is no repo for this dataset, am i right @cakiki?
@albertvillanova I'd forgotten to push this one; I still haven't rehydrated the tweets. (the dataset is only tweet IDs for now; i'll run the script later today)
https://huggingface.co/datasets/bigscience-catalogue-data/qadi
Note: test set consists of actual tweets, one per line, followed by a ISO 3166-1 alpha-2 country code.
I think I will flatten the train folder structure and include the country code as part of each record.
@albertvillanova Done, only 85% of tweet ids were still useful.
Rehydrated using the code from https://github.com/bigscience-workshop/data_tooling/issues/103#issuecomment-1019348527
Source: Masader Project