Closes #512 | Add/Update Dataloader QED

patrickamadeus commented 3 months ago

Closes #512

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[.] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

T2T

malay-thai

malay-burmese

english-javanese

indonesian-javanese

SSP

malay-burmese

indonesian-javanese

NB:

4 samples tested out of 28 lang pairs for T2T, and 2 samples out of 56 cases for SSP
For pairs with en or id, first line is outlier because it shows the document title (might be misinterpreted as an error as depicted from last 2 examples)
~Currently involving only Tasks.MACHINE_TRANSLATION, will add task addition is needed for language modeling.~

patrickamadeus commented 3 months ago

will add ssp schema within this week after discussing with @SamuelCahyawijaya

patrickamadeus commented 3 months ago

Updated SSP approach with (some) retests!

cc @ljvmiranda921 @holylovenia @SamuelCahyawijaya

MT `jv-ms`

SSP `jv`

SSP `vi`

SSP `tl`

holylovenia commented 3 months ago

Checking the data source, we can add the ff. SEA languages to the dataloader:

Cebuano

Filipino (fil): Normally we consider tl and fil to be the same, but the monolingual documents of fil and tl here are different.

Khmer

Lao

Madurese

Pampanga

@holylovenia Please also update the datasheet accordingly, thanks!

I'd like to confirm before I rectify the language list.

So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"], right? I'm assuming Pampanga == Kapampangan (pam).

cc: @elyanah-aco @ljvmiranda921

elyanah-aco commented 3 months ago

So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"], right? I'm assuming Pampanga == Kapampangan (pam).

Yes this is correct, thanks!

holylovenia commented 3 months ago

So the complete list of the languages is ["eng", "vie", "tha", "mya", "jav", "ind", "tgl", "zlm", "ceb", "fil", "khm", "lao", "mad", "pam"], right? I'm assuming Pampanga == Kapampangan (pam).

Yes this is correct, thanks!

Done!

holylovenia commented 2 months ago

Hi @patrickamadeus, a friendly reminder to address @elyanah-aco's suggestions.

patrickamadeus commented 2 months ago

Thank you for the review! Updated the lang list. @elyanah-aco @holylovenia

patrickamadeus commented 2 months ago

Thank you for the nitpick! It's done in the latest commit! @elyanah-aco

holylovenia commented 2 months ago

Thank you for the nitpick! It's done in the latest commit! @elyanah-aco

Hi @elyanah-aco, may I know if you're ready to approve this PR? If all your concerns have been addressed by @patrickamadeus, I'm inclined to get this PR merged.

SEACrowd / seacrowd-datahub