Closed SamuelCahyawijaya closed 11 months ago
There is an updated dataset of OSCAR from the same group, namely oscar-2301. Should I submit a new public dataset or just implement both of them in this issue since they have similar structures and metadata? @SamuelCahyawijaya @holylovenia
Edit to dataset description:
Hi @akhdanfadh, sorry for the late reply.
Regarding oscar-2301, from your observation, is the data from oscar-2201 also included in oscar-2301? If yes, I'm tempted to modify the datasheet to oscar-2301 and update the info accordingly. (cc: @SamuelCahyawijaya what do you think?)
oscar-2201 is based on the Common Crawl Nov/Dec 2021 snapshot while oscar-2301 is the next year's, so I'm guessing it is quite similar. From the dataset card itself,
While being quite similar to OSCAR 22.01, it contains several new features, including KenLM-based adult content detection, precomputed Locality-Sensitive Hashes for near deduplication, and blocklist-based categories.
Things to note are the language subsets, though these are not included in our approved SEA languages:
Dataloader name:
oscar_2201/oscar_2201.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?oscar_2201