SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

68 stars 57 forks source link

Create dataset loader for OSCAR-2201 #60

Closed SamuelCahyawijaya closed 11 months ago

SamuelCahyawijaya commented 12 months ago

Dataloader name: oscar_2201/oscar_2201.py DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?oscar_2201

Dataset	oscar_2201
Description	OSCAR or Open Super-large Crawled Aggregated coRpus is a huge multilingual corpus obtained by language classification and filtering of the Common Crawl corpus using the ungoliant architecture. Data is distributed by language in both original and deduplicated form.
Subsets	-
Languages	ind, zlm, jav, vie
Tasks	Language Modeling
License	BigScience OpenRAIL-M (bigscience-openrail-m)
Homepage	https://huggingface.co/datasets/oscar-corpus/OSCAR-2201
HF URL	https://huggingface.co/datasets/oscar-corpus/OSCAR-2201
Paper URL	https://arxiv.org/abs/2201.06642

akhdanfadh commented 12 months ago

self-assign

akhdanfadh commented 12 months ago

There is an updated dataset of OSCAR from the same group, namely oscar-2301. Should I submit a new public dataset or just implement both of them in this issue since they have similar structures and metadata? @SamuelCahyawijaya @holylovenia

akhdanfadh commented 12 months ago

Edit to dataset description:

License doesn't match with that in huggingface, it is a custom license.
Matching SEA languages in the dataset are:
- Waray ('war')
- Cebuano ('ceb')
- Minangkabau ('min')
- Vietnamese ('vi')
- Tamil ('ta')
- Iloko ('ilo')
- Filipino ('tl')
- Lao ('lo')
- Khmer ('km')
- Burmese ('my')
- Javanese ('jv')
- Indonesian ('id')
- Thai ('th')
- Sundanese ('su')
- Malay ('ms')
There are multiple citations for this dataset, should I put all of them?

holylovenia commented 11 months ago

Hi @akhdanfadh, sorry for the late reply.

Regarding oscar-2301, from your observation, is the data from oscar-2201 also included in oscar-2301? If yes, I'm tempted to modify the datasheet to oscar-2301 and update the info accordingly. (cc: @SamuelCahyawijaya what do you think?)

I'll re-check and edit the datasheet to match the correct one tonight.
That's great!
Yes.

akhdanfadh commented 11 months ago

oscar-2201 is based on the Common Crawl Nov/Dec 2021 snapshot while oscar-2301 is the next year's, so I'm guessing it is quite similar. From the dataset card itself,

While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories.

Things to note are the language subsets, though these are not included in our approved SEA languages:

Languages in 2201 but not in 2301: {'eml', 'als', 'diq', 'scn'}
Languages in 2301 but not in 2201: {'x-eml', 'gsw', 'mwl', 'ht', 'ie'}