IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
261 stars 61 forks source link

Closes #245 | Create dataset loader for KoPI-NLLB #261

Closed acul3 closed 1 year ago

acul3 commented 2 years ago

245


accepted config name format:

datasetname_{lang}-{dedup format}_{schema}

example: say if you want to load aceh language with neardup format and nusantara source ,the code will look like

from datasets import load_dataset
dataset = load_dataset("/data/nusa-crowd/nusantara/nusa_datasets/kopi_nlllb/kopi_nllb.py",name="kopi_nllb_ace_Latn-neardup_nusantara_ssp")

test unit

python -m tests.test_nusantara nusantara/nusa_datasets/kopi_nlllb/kopi_nllb.py --subset_id kopi_nllb_ace_Latn-neardup

Checkbox

christianwbsn commented 1 year ago

/test dataset=kopi_nllb subset_id=kopi_nllb_ace_Latn-neardup

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3051388439

christianwbsn commented 1 year ago

Not quite familiar with the dataset, but find this sample from test data weird, doesn't look like Acehnese language {'id': '0', 'text': 'diff geurende lelietje-van-dalen ilan 150ml, ivoor'}

acul3 commented 1 year ago

yeah probably some of sample got incorrect language due language identification using LASER model or original source dataset

see 5.2 section on NLLB for detail data gathering

btw for dedup and neardup,, i only take LASER score higher than 0.9

SamuelCahyawijaya commented 1 year ago

Year, kinda agree with @acul3, I think it is relatively common for data from LASER to be rather noisy especially for such low resource languages. I think we can approve this dataset.