Closed ssun32 closed 5 months ago
Hi @ssun32, thank you for your contribution. I haven't done an in-depth checking, but I noticed that you have a subset per language. However, we only need a
source
schema to facilitate returning all data and aseacrowd
schema to facilitate returning all data because the task is language identification.cc: Please CMIIW, @MJonibek.
Hello @holylovenia, thank you so much for taking the time to review this pull request. The issue with having a source
schema and seacrowd
to load all languages is that the user would have little control over which languages to load. For example, if a user only needs the "Malay" and "Tamil" subsets, he/she must write a filter function over the entire dataset with 100+ languages to pick out the Malay and Tamil instances. It might be better to allow the user to load the Malay subset and Tamil subset independently and concatenate them together.
Would it better to modify the dataloader such that source
and seacrowd
return all languages, while source_[lang_id]
and seacrowd_[lang_id]
return the subsets?
Hi @ssun32, thank you for your contribution. I haven't done an in-depth checking, but I noticed that you have a subset per language. However, we only need a
source
schema to facilitate returning all data and aseacrowd
schema to facilitate returning all data because the task is language identification. cc: Please CMIIW, @MJonibek.Hello @holylovenia, thank you so much for taking the time to review this pull request. The issue with having a
source
schema andseacrowd
to load all languages is that the user would have little control over which languages to load. For example, if a user only needs the "Malay" and "Tamil" subsets, he/she must write a filter function over the entire dataset with 100+ languages to pick out the Malay and Tamil instances. It might be better to allow the user to load the Malay subset and Tamil subset independently and concatenate them together.Would it better to modify the dataloader such that
source
andseacrowd
return all languages, whilesource_[lang_id]
andseacrowd_[lang_id]
return the subsets?
I see what you mean. Do you suppose that splitting a LID dataset based on its language labels (i.e., {_DATASETNAME}_{lang}_seacrowd_text
and {_DATASETNAME}_{lang}_source
subsets) has any merit? If yes, then we can leave it be as long as the {_DATASETNAME}_seacrowd_text
and {_DATASETNAME}_source
subsets return all of the data.
Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using
load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')
0 examples generating
Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using
load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')
Wait @MJonibek, now I've been enlightened and I quite understand @ssun32's point after thinking for a while... If we have per-language subsets, it would be easier to have different subset compositions.
For example, if I want to have Indonesian-only languages, I can just use {_DATASETNAME}_ind_seacrowd_text
, {_DATASETNAME}_sun_seacrowd_text
, and {_DATASETNAME}_jav_seacrowd_text
without including {_DATASETNAME}_mya_seacrowd_text
and {_DATASETNAME}_ceb_seacrowd_text
.
Now I'm wondering if we should have per-language subsets for all of the existing and the incoming LID dataloaders. π
Do you think my logic makes sense, @MJonibek @ssun32? Or is there any other idea you'd suggest?
cc: I'm tagging @sabilmakbar in this discussion loop since we were discussing the same thing on Discord.
Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using
load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')
0 examples generating
The issue with this dataset is that it requires running a complicated 500+ lines shell script to preprocess the data and I took a naive approach to run that script via subprocess.call, hoping that it would support most operating systems other than Windows. However, I just realized some of the binaries are not that cross-platform compatible, e.g., the subsample binary only works on my Ubuntu server but not on my Mac.
@holylovenia @MJonibek Should we should convert this to a local dataset and ask the users run the preprocessing script manually instead?
Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using
load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')
0 examples generatingThe issue with this dataset is that it requires running a complicated 500+ lines shell script to preprocess the data and I took a naive approach to run that script via subprocess.call, hoping that it would support most operating systems other than Windows. However, I just realized some of the binaries are not that cross-platform compatible, e.g., the subsample binary only works on my Ubuntu server but not on my Mac.
@holylovenia @MJonibek Should we should convert this to a local dataset and ask the users run the preprocessing script manually instead?
@ssun32 May I know what kind of preprocessing is needed? Is it to download the data files or indexing or something else?
Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using
load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')
Wait @MJonibek, now I've been enlightened and I quite understand @ssun32's point after thinking for a while... If we have per-language subsets, it would be easier to have different subset compositions.
For example, if I want to have Indonesian-only languages, I can just use
{_DATASETNAME}_ind_seacrowd_text
,{_DATASETNAME}_sun_seacrowd_text
, and{_DATASETNAME}_jav_seacrowd_text
without including{_DATASETNAME}_mya_seacrowd_text
and{_DATASETNAME}_ceb_seacrowd_text
.Now I'm wondering if we should have per-language subsets for all of the existing and the incoming LID dataloaders. π
Do you think my logic makes sense, @MJonibek @ssun32? Or is there any other idea you'd suggest?
cc: I'm tagging @sabilmakbar in this discussion loop since we were discussing the same thing on Discord.
Regarding this, I hope you don't mind waiting for a little more time, @ssun32. We're going to have an internal discussion on what subsets LID dataloaders should consist of. I'll get back to you soon.
Hello @ssun32, I agree with @holylovenia. I understand your point, but to write the filter function is not a big problem. Also, while using
load_dataset('seacrowd/sea_datasets/lti_langid_corpus/lti_langid_corpus.py', name='lti_langid_corpus_source')
Wait @MJonibek, now I've been enlightened and I quite understand @ssun32's point after thinking for a while... If we have per-language subsets, it would be easier to have different subset compositions. For example, if I want to have Indonesian-only languages, I can just use
{_DATASETNAME}_ind_seacrowd_text
,{_DATASETNAME}_sun_seacrowd_text
, and{_DATASETNAME}_jav_seacrowd_text
without including{_DATASETNAME}_mya_seacrowd_text
and{_DATASETNAME}_ceb_seacrowd_text
. Now I'm wondering if we should have per-language subsets for all of the existing and the incoming LID dataloaders. π Do you think my logic makes sense, @MJonibek @ssun32? Or is there any other idea you'd suggest? cc: I'm tagging @sabilmakbar in this discussion loop since we were discussing the same thing on Discord.Regarding this, I hope you don't mind waiting for a little more time, @ssun32. We're going to have an internal discussion on what subsets LID dataloaders should consist of. I'll get back to you soon.
Hi @ssun32, for LID dataloaders, we agreed upon having an aggregate source
schema to facilitate returning all data and an aggregate seacrowd
schema to facilitate returning all data. No need for per-language subsets. If future users need to filter some language labels, we will let them do it after loading the data.
We will check the other LID dataloaders to ensure that the language labels are also in the same format as used here, i.e., 3-letter ISO code.
cc: @MJonibek @SamuelCahyawijaya @sabilmakbar
Hi @ssun32, please let us know if you need any help for this PR! π
Hi @ssun32, we from SEACrowd are currently looking forward to using your dataloader in our experiments. Do you think it will be possible for you to address the questions and/or requested changes? π We aim to merge it by the end of this week.
@holylovenia I agree, additional information in _DESCRIPTION
is good idea
Hi @ssun32, please let us know if you need any help for this PR! π
Hi @ssun32, please let us know if you need any help for this PR! π
@holylovenia, added new commits to address the requested changes. I replaced both \u200b and \u200d with space.
Closes #364
Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.