Closed luckysusanto closed 7 months ago
These data were used to train XLM for machine translation. As there are 4 parallel corpora and 2 monolingual corpora, a total of 2 (default schema, goes to parallel_ind_jav) + 12 Builder Configs are set.
The other 12 Custom Builder Configs are:
indonesiannmt-{modifier}_{schema}
where
modifier = {'mono_ind', 'mono_jav', 'parallel_ind_jav', 'parallel_ind_min', 'parallel_ind_sun', 'parallel_ind_ban'}
schema = {'source', 'seacrowd_t2t'}
@luckysusanto Hello, thank you for working on this dataloader. I think it is better to redo configurations for dataloader. I don`t think it is good idea to add monolingual files for machine translation tasks. It is better just to remove them (or maybe add them under another task, @holylovenia what do you think).
Also, it is better to rename modifiers from parallel_{lang1}_{lang2}
to {lang1}_{lang2}
by deleting parallel
prefixes.
Noted, I will change the schemes of the dataloader @MJonibek
@holylovenia the monolingual data was used for unsupervised machine learning on the XLM architecture. That was why I still left it there under the machine learning task, even though it is unintuitive.
I'll check on what the ssp schema is and will implement the dataloader in the near future if it works! Thanks for the inputs! ^^
Noted, I will change the schemes of the dataloader @MJonibek
@holylovenia the monolingual data was used for unsupervised machine learning on the XLM architecture. That was why I still left it there under the machine learning task, even though it is unintuitive.
I'll check on what the ssp schema is and will implement the dataloader in the near future if it works! Thanks for the inputs! ^^
You can take a look at the ssp
schema, used for Tasks.LANGUAGE_MODELING
aka unlabelled data.
Sure sure, take your time and let us know if you would like to discuss. Nice to e-meet you after AACL, @luckysusanto! 😄
Hi @ & @, may I know if you are still working on this PR?
Yes, I am still working on this PR. Sorry, I am busy since early February, but i'll fix all the issues mentioned by 19 Feb.
Requesting review: @MJonibek @holylovenia
Updated: Only 8 dataloader schemes remaining, 4 for source, 4 for seacrowd_t2t
- Is there any reason why you decide not to include the monolingual data in the dataloader? Previously, we have discussed that it might be included using the
ssp
schema. (cc: What do you think, @MJonibek?)
I also think it would be great to add ssp
schema for monolingual data sets
Is there any reason why you decide not to include the monolingual data in the dataloader? Previously, we have discussed that it might be included using the ssp schema. (cc: What do you think, @MJonibek?)
My bad! Yeah, it skipped my mind, entirely my fault. I'll implement it while also fixing the styling issues you have mentioned @holylovenia.
Will ping you guys later when I fix it, ETA: Wednesday
Requesting re-review: @holylovenia @MJonibek
Update Log: SSP task implemented for monolingual data Changes requested by holy was implemented!
Test ran successfully: python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py --schema=T2T --subset_id=indonesiannmt_ind_jav python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py --schema=SSP --subset_id=indonesiannmt_ind
@holylovenia, I`ve done formatting by myself, can merge
Closes #338
Checkbox
seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/indonesiannmt/indonesiannmt.py
.