SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #364 | Implement dataloader for LTI langID Corpus #405

Closed ssun32 closed 5 months ago

ssun32 commented 8 months ago

Closes #364

Checkbox

ssun32 commented 7 months ago

Hi @ssun32, thank you for your contribution. I haven't done an in-depth checking, but I noticed that you have a subset per language. However, we only need a source schema to facilitate returning all data and a seacrowd schema to facilitate returning all data because the task is language identification.

cc: Please CMIIW, @MJonibek.

Hello @holylovenia, thank you so much for taking the time to review this pull request. The issue with having a source schema and seacrowd to load all languages is that the user would have little control over which languages to load. For example, if a user only needs the "Malay" and "Tamil" subsets, he/she must write a filter function over the entire dataset with 100+ languages to pick out the Malay and Tamil instances. It might be better to allow the user to load the Malay subset and Tamil subset independently and concatenate them together.

Would it better to modify the dataloader such that source and seacrowd return all languages, while source_[lang_id] and seacrowd_[lang_id] return the subsets?

holylovenia commented 7 months ago

Hi @ssun32, thank you for your contribution. I haven't done an in-depth checking, but I noticed that you have a subset per language. However, we only need a source schema to facilitate returning all data and a seacrowd schema to facilitate returning all data because the task is language identification. cc: Please CMIIW, @MJonibek.

Hello @holylovenia, thank you so much for taking the time to review this pull request. The issue with having a source schema and seacrowd to load all languages is that the user would have little control over which languages to load. For example, if a user only needs the "Malay" and "Tamil" subsets, he/she must write a filter function over the entire dataset with 100+ languages to pick out the Malay and Tamil instances. It might be better to allow the user to load the Malay subset and Tamil subset independently and concatenate them together.

Would it better to modify the dataloader such that source and seacrowd return all languages, while source_[lang_id] and seacrowd_[lang_id] return the subsets?

I see what you mean. Do you suppose that splitting a LID dataset based on its language labels (i.e., {_DATASETNAME}_{lang}_seacrowd_text and {_DATASETNAME}_{lang}_source subsets) has any merit? If yes, then we can leave it be as long as the {_DATASETNAME}_seacrowd_text and {_DATASETNAME}_source subsets return all of the data.

MJonibek commented 7 months ago

Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')

0 examples generating

image
holylovenia commented 7 months ago

Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')

Wait @MJonibek, now I've been enlightened and I quite understand @ssun32's point after thinking for a while... If we have per-language subsets, it would be easier to have different subset compositions.

For example, if I want to have Indonesian-only languages, I can just use {_DATASETNAME}_ind_seacrowd_text, {_DATASETNAME}_sun_seacrowd_text, and {_DATASETNAME}_jav_seacrowd_text without including {_DATASETNAME}_mya_seacrowd_text and {_DATASETNAME}_ceb_seacrowd_text.

Now I'm wondering if we should have per-language subsets for all of the existing and the incoming LID dataloaders. πŸ˜‚

Do you think my logic makes sense, @MJonibek @ssun32? Or is there any other idea you'd suggest?

cc: I'm tagging @sabilmakbar in this discussion loop since we were discussing the same thing on Discord.

ssun32 commented 7 months ago

Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')

0 examples generating

image

The issue with this dataset is that it requires running a complicated 500+ lines shell script to preprocess the data and I took a naive approach to run that script via subprocess.call, hoping that it would support most operating systems other than Windows. However, I just realized some of the binaries are not that cross-platform compatible, e.g., the subsample binary only works on my Ubuntu server but not on my Mac.

@holylovenia @MJonibek Should we should convert this to a local dataset and ask the users run the preprocessing script manually instead?

holylovenia commented 7 months ago

Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source') 0 examples generating

image

The issue with this dataset is that it requires running a complicated 500+ lines shell script to preprocess the data and I took a naive approach to run that script via subprocess.call, hoping that it would support most operating systems other than Windows. However, I just realized some of the binaries are not that cross-platform compatible, e.g., the subsample binary only works on my Ubuntu server but not on my Mac.

@holylovenia @MJonibek Should we should convert this to a local dataset and ask the users run the preprocessing script manually instead?

@ssun32 May I know what kind of preprocessing is needed? Is it to download the data files or indexing or something else?

holylovenia commented 7 months ago

Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')

Wait @MJonibek, now I've been enlightened and I quite understand @ssun32's point after thinking for a while... If we have per-language subsets, it would be easier to have different subset compositions.

For example, if I want to have Indonesian-only languages, I can just use {_DATASETNAME}_ind_seacrowd_text, {_DATASETNAME}_sun_seacrowd_text, and {_DATASETNAME}_jav_seacrowd_text without including {_DATASETNAME}_mya_seacrowd_text and {_DATASETNAME}_ceb_seacrowd_text.

Now I'm wondering if we should have per-language subsets for all of the existing and the incoming LID dataloaders. πŸ˜‚

Do you think my logic makes sense, @MJonibek @ssun32? Or is there any other idea you'd suggest?

cc: I'm tagging @sabilmakbar in this discussion loop since we were discussing the same thing on Discord.

Regarding this, I hope you don't mind waiting for a little more time, @ssun32. We're going to have an internal discussion on what subsets LID dataloaders should consist of. I'll get back to you soon.

holylovenia commented 7 months ago

Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')

Wait @MJonibek, now I've been enlightened and I quite understand @ssun32's point after thinking for a while... If we have per-language subsets, it would be easier to have different subset compositions. For example, if I want to have Indonesian-only languages, I can just use {_DATASETNAME}_ind_seacrowd_text, {_DATASETNAME}_sun_seacrowd_text, and {_DATASETNAME}_jav_seacrowd_text without including {_DATASETNAME}_mya_seacrowd_text and {_DATASETNAME}_ceb_seacrowd_text. Now I'm wondering if we should have per-language subsets for all of the existing and the incoming LID dataloaders. πŸ˜‚ Do you think my logic makes sense, @MJonibek @ssun32? Or is there any other idea you'd suggest? cc: I'm tagging @sabilmakbar in this discussion loop since we were discussing the same thing on Discord.

Regarding this, I hope you don't mind waiting for a little more time, @ssun32. We're going to have an internal discussion on what subsets LID dataloaders should consist of. I'll get back to you soon.

Hi @ssun32, for LID dataloaders, we agreed upon having an aggregate source schema to facilitate returning all data and an aggregate seacrowd schema to facilitate returning all data. No need for per-language subsets. If future users need to filter some language labels, we will let them do it after loading the data.

We will check the other LID dataloaders to ensure that the language labels are also in the same format as used here, i.e., 3-letter ISO code.

cc: @MJonibek @SamuelCahyawijaya @sabilmakbar

holylovenia commented 6 months ago

Hi @ssun32, please let us know if you need any help for this PR! πŸ‘

holylovenia commented 6 months ago

Hi @ssun32, we from SEACrowd are currently looking forward to using your dataloader in our experiments. Do you think it will be possible for you to address the questions and/or requested changes? πŸ˜‚ We aim to merge it by the end of this week.

MJonibek commented 6 months ago

@holylovenia I agree, additional information in _DESCRIPTION is good idea

holylovenia commented 6 months ago

Hi @ssun32, please let us know if you need any help for this PR! πŸ‘

ssun32 commented 5 months ago

Hi @ssun32, please let us know if you need any help for this PR! πŸ‘

@holylovenia, added new commits to address the requested changes. I replaced both \u200b and \u200d with space.