Closes #338 | Created DataLoader for IndonesianNMT

luckysusanto commented 8 months ago

Closes #338

Checkbox

[X] Confirm that this PR is linked to the dataset issue.
[X] Create the dataloader script seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py (please use only lowercase and underscore for dataset naming).
[X] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[X] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[X] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[X] Confirm dataloader script works with datasets.load_dataset function.
[X] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

luckysusanto commented 8 months ago

These data were used to train XLM for machine translation. As there are 4 parallel corpora and 2 monolingual corpora, a total of 2 (default schema, goes to parallel_ind_jav) + 12 Builder Configs are set.

The other 12 Custom Builder Configs are: indonesiannmt-{modifier}_{schema} where modifier = {'mono_ind', 'mono_jav', 'parallel_ind_jav', 'parallel_ind_min', 'parallel_ind_sun', 'parallel_ind_ban'} schema = {'source', 'seacrowd_t2t'}

MJonibek commented 8 months ago

@luckysusanto Hello, thank you for working on this dataloader. I think it is better to redo configurations for dataloader. I don`t think it is good idea to add monolingual files for machine translation tasks. It is better just to remove them (or maybe add them under another task, @holylovenia what do you think).

Also, it is better to rename modifiers from parallel_{lang1}_{lang2} to {lang1}_{lang2} by deleting parallel prefixes.

luckysusanto commented 8 months ago

Noted, I will change the schemes of the dataloader @MJonibek

@holylovenia the monolingual data was used for unsupervised machine learning on the XLM architecture. That was why I still left it there under the machine learning task, even though it is unintuitive.

I'll check on what the ssp schema is and will implement the dataloader in the near future if it works! Thanks for the inputs! ^^

holylovenia commented 8 months ago

Noted, I will change the schemes of the dataloader @MJonibek

@holylovenia the monolingual data was used for unsupervised machine learning on the XLM architecture. That was why I still left it there under the machine learning task, even though it is unintuitive.

I'll check on what the ssp schema is and will implement the dataloader in the near future if it works! Thanks for the inputs! ^^

You can take a look at the ssp schema, used for Tasks.LANGUAGE_MODELING aka unlabelled data.

Sure sure, take your time and let us know if you would like to discuss. Nice to e-meet you after AACL, @luckysusanto! 😄

github-actions[bot] commented 8 months ago

Hi @ & @, may I know if you are still working on this PR?

luckysusanto commented 8 months ago

Yes, I am still working on this PR. Sorry, I am busy since early February, but i'll fix all the issues mentioned by 19 Feb.

luckysusanto commented 8 months ago

Requesting review: @MJonibek @holylovenia

Updated: Only 8 dataloader schemes remaining, 4 for source, 4 for seacrowd_t2t

MJonibek commented 7 months ago

Is there any reason why you decide not to include the monolingual data in the dataloader? Previously, we have discussed that it might be included using the ssp schema. (cc: What do you think, @MJonibek?)

I also think it would be great to add ssp schema for monolingual data sets

luckysusanto commented 7 months ago

Is there any reason why you decide not to include the monolingual data in the dataloader? Previously, we have discussed that it might be included using the ssp schema. (cc: What do you think, @MJonibek?)

My bad! Yeah, it skipped my mind, entirely my fault. I'll implement it while also fixing the styling issues you have mentioned @holylovenia.

Will ping you guys later when I fix it, ETA: Wednesday

luckysusanto commented 7 months ago

Requesting re-review: @holylovenia @MJonibek

Update Log: SSP task implemented for monolingual data Changes requested by holy was implemented!

Test ran successfully: python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py --schema=T2T --subset_id=indonesiannmt_ind_jav python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py --schema=SSP --subset_id=indonesiannmt_ind

MJonibek commented 7 months ago

@holylovenia, I`ve done formatting by myself, can merge

SEACrowd / seacrowd-datahub

Closes #338 | Created DataLoader for IndonesianNMT #367

Checkbox