SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #338 | Created DataLoader for IndonesianNMT #367

Closed luckysusanto closed 7 months ago

luckysusanto commented 8 months ago

Closes #338

Checkbox

luckysusanto commented 8 months ago

These data were used to train XLM for machine translation. As there are 4 parallel corpora and 2 monolingual corpora, a total of 2 (default schema, goes to parallel_ind_jav) + 12 Builder Configs are set.

The other 12 Custom Builder Configs are: indonesiannmt-{modifier}_{schema} where modifier = {'mono_ind', 'mono_jav', 'parallel_ind_jav', 'parallel_ind_min', 'parallel_ind_sun', 'parallel_ind_ban'} schema = {'source', 'seacrowd_t2t'}

MJonibek commented 8 months ago

@luckysusanto Hello, thank you for working on this dataloader. I think it is better to redo configurations for dataloader. I don`t think it is good idea to add monolingual files for machine translation tasks. It is better just to remove them (or maybe add them under another task, @holylovenia what do you think).

Also, it is better to rename modifiers from parallel_{lang1}_{lang2} to {lang1}_{lang2} by deleting parallel prefixes.

luckysusanto commented 8 months ago

Noted, I will change the schemes of the dataloader @MJonibek

@holylovenia the monolingual data was used for unsupervised machine learning on the XLM architecture. That was why I still left it there under the machine learning task, even though it is unintuitive.

I'll check on what the ssp schema is and will implement the dataloader in the near future if it works! Thanks for the inputs! ^^

holylovenia commented 8 months ago

Noted, I will change the schemes of the dataloader @MJonibek

@holylovenia the monolingual data was used for unsupervised machine learning on the XLM architecture. That was why I still left it there under the machine learning task, even though it is unintuitive.

I'll check on what the ssp schema is and will implement the dataloader in the near future if it works! Thanks for the inputs! ^^

You can take a look at the ssp schema, used for Tasks.LANGUAGE_MODELING aka unlabelled data.

Sure sure, take your time and let us know if you would like to discuss. Nice to e-meet you after AACL, @luckysusanto! 😄

github-actions[bot] commented 8 months ago

Hi @ & @, may I know if you are still working on this PR?

luckysusanto commented 8 months ago

Yes, I am still working on this PR. Sorry, I am busy since early February, but i'll fix all the issues mentioned by 19 Feb.

luckysusanto commented 8 months ago

Requesting review: @MJonibek @holylovenia

Updated: Only 8 dataloader schemes remaining, 4 for source, 4 for seacrowd_t2t

MJonibek commented 7 months ago
  1. Is there any reason why you decide not to include the monolingual data in the dataloader? Previously, we have discussed that it might be included using the ssp schema. (cc: What do you think, @MJonibek?)

I also think it would be great to add ssp schema for monolingual data sets

luckysusanto commented 7 months ago

Is there any reason why you decide not to include the monolingual data in the dataloader? Previously, we have discussed that it might be included using the ssp schema. (cc: What do you think, @MJonibek?)

My bad! Yeah, it skipped my mind, entirely my fault. I'll implement it while also fixing the styling issues you have mentioned @holylovenia.

Will ping you guys later when I fix it, ETA: Wednesday

luckysusanto commented 7 months ago

Requesting re-review: @holylovenia @MJonibek

Update Log: SSP task implemented for monolingual data Changes requested by holy was implemented!

Test ran successfully: python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py --schema=T2T --subset_id=indonesiannmt_ind_jav python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py --schema=SSP --subset_id=indonesiannmt_ind

MJonibek commented 7 months ago

@holylovenia, I`ve done formatting by myself, can merge