acul3 commented 2 years ago

245

accepted config name format:

datasetname_{lang}-{dedup format}_{schema}

lang(3-letter ISO code) : ['all','ace_Latn','ban_Latn','bjn_Latn','ind_Latn','jav_Latn','min_Latn','sun_Latn']
dedup_format : type applied deduplication that need to load ['raw','dedup','neardup']
schema : nusa or original source ['nusantara_ssp','source']

example: say if you want to load aceh language with neardup format and nusantara source ,the code will look like

from datasets import load_dataset
dataset = load_dataset("/data/nusa-crowd/nusantara/nusa_datasets/kopi_nlllb/kopi_nllb.py",name="kopi_nllb_ace_Latn-neardup_nusantara_ssp")

test unit

python -m tests.test_nusantara nusantara/nusa_datasets/kopi_nlllb/kopi_nllb.py --subset_id kopi_nllb_ace_Latn-neardup

Checkbox

[x ] Confirm that this PR is linked to the dataset issue.
[ x] Create the dataloader script nusantara/nusa_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[ x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _NUSANTARA_VERSION variables.
[ x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[ x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one NusantaraConfig for the source schema and one for a nusantara schema.
[x ] Confirm dataloader script works with datasets.load_dataset function.
[ x] Confirm that your dataloader script passes the test suite run with python -m tests.test_nusantara --path=nusantara/nusa_datasets/my_dataset/my_dataset.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

christianwbsn commented 1 year ago

/test dataset=kopi_nllb subset_id=kopi_nllb_ace_Latn-neardup

github-actions[bot] commented 1 year ago

Run result

Check test log here: https://github.com/IndoNLP/nusa-crowd/actions/runs/3051388439

christianwbsn commented 1 year ago

Not quite familiar with the dataset, but find this sample from test data weird, doesn't look like Acehnese language {'id': '0', 'text': 'diff geurende lelietje-van-dalen ilan 150ml, ivoor'}

acul3 commented 1 year ago

yeah probably some of sample got incorrect language due language identification using LASER model or original source dataset

see 5.2 section on NLLB for detail data gathering

btw for dedup and neardup,, i only take LASER score higher than 0.9

SamuelCahyawijaya commented 1 year ago

Year, kinda agree with @acul3, I think it is relatively common for data from LASER to be rather noisy especially for such low resource languages. I think we can approve this dataset.

IndoNLP / nusa-crowd

Closes #245 | Create dataset loader for KoPI-NLLB #261

245

Checkbox

Run result