bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
77 stars 48 forks source link

Create dataset QADI Arabic #284

Open albertvillanova opened 2 years ago

albertvillanova commented 2 years ago

Source: Masader Project

cakiki commented 2 years ago

self-assign

albertvillanova commented 2 years ago

I think there is no repo for this dataset, am i right @cakiki?

cakiki commented 2 years ago

@albertvillanova I'd forgotten to push this one; I still haven't rehydrated the tweets. (the dataset is only tweet IDs for now; i'll run the script later today)

https://huggingface.co/datasets/bigscience-catalogue-data/qadi

cakiki commented 2 years ago

Note: test set consists of actual tweets, one per line, followed by a ISO 3166-1 alpha-2 country code.

I think I will flatten the train folder structure and include the country code as part of each record.

cakiki commented 2 years ago

@albertvillanova Done, only 85% of tweet ids were still useful.

Rehydrated using the code from https://github.com/bigscience-workshop/data_tooling/issues/103#issuecomment-1019348527