SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #10 | Create beaye_lexicon dataset loader #320

Closed joanitolopo closed 7 months ago

joanitolopo commented 9 months ago

Closes #10

Checkbox

danjohnvelasco commented 8 months ago

Hi @joanitolopođź‘‹, thanks for the PR! Below are my requested changes and queries/concerns.

Requested changes:

For my queries/concerns:

  1. I've read that this is a lexicon containing translations between Indonesian, English, and Beaye words. However, in the current implementation, I only see list of ind words. Since this is a lexicon with parallel translations, shouldn't we also include the translations and not just the ind words?
  2. There are 2 subsets from the source dataset, the lexicon.xlsx (for ind <-> day) and english.xlsx (for eng <-> day). If eng <-> day is of interest for SEACrowd, it should also be implemented as different subsets. They will have different config names (e.g. beaye_lexicon_ind_day_souce and beaye_lexicon_eng_day_souce). You might want to modify how the current code creates BUILDER_CONFIGS at line 59.

Please do let me know if you have any questions or disagreements. We can discuss it here. Thanks!

joanitolopo commented 8 months ago

Hi @danjohnvelasco đź‘‹, thanks for the requested changes and queries/concerns.

Requested changes:

Test suite fails on current code. I encourage you to run test suite on your side via: python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/beaye_lexicon/beaye_lexicon.py

Since the dataloader doesn't implement SEACrowd schema, we need to use optional argument --subset_id. So the testing should like this: python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/beaye_lexicon/beaye_lexicon.py --subset_id beaye_lexicon_ind. I also had an error before and the suggestion is used argument (https://github.com/SEACrowd/seacrowd-datahub/issues/10#issuecomment-1878055738)

Queries/concerns:

I've read that this is a lexicon containing translations between Indonesian, English, and Beaye words. However, in the current implementation, I only see list of ind words. Since this is a lexicon with parallel translations, shouldn't we also include the translations and not just the ind words?

I am not sure to apply translation task to this dataloader since this only contain word-pairs. If it possible to apply, that means it will be a translation task and also we need to apply seacrowd scema too? Whats your take on this? @danjohnvelasco @holylovenia @sabilmakbar

There are 2 subsets from the source dataset, the lexicon.xlsx (for ind <-> day) and english.xlsx (for eng <-> day). If eng <-> day is of interest for SEACrowd, it should also be implemented as different subsets. They will have different config names (e.g. beaye_lexicon_ind_day_souce and beaye_lexicon_eng_day_souce). You might want to modify how the current code creates BUILDER_CONFIGS at line 59.

Unfortunately both source datasets are not parallel. In lexicon.xlsx contains cleaned data which selected and evaluated by linguistic experts. Meanwhile, english.xlsx is another parallel translation of extensive english definition of beaye words

Thanks!

holylovenia commented 8 months ago

I've read that this is a lexicon containing translations between Indonesian, English, and Beaye words. However, in the current implementation, I only see list of ind words. Since this is a lexicon with parallel translations, shouldn't we also include the translations and not just the ind words?

I am not sure to apply translation task to this dataloader since this only contain word-pairs. If it possible to apply, that means it will be a translation task and also we need to apply seacrowd scema too? Whats your take on this? @danjohnvelasco @holylovenia @sabilmakbar

I think it's fine not to have the t2t schema since we don't usually use word pairs for Tasks.MACHINE_TRANSLATION.

danjohnvelasco commented 8 months ago

Hi @joanitolopo

In that case, maybe the source schema only should be implemented. For lexicon.xlsx, the dataloader should show two features: ind and day.

I updated the requested changes below:

Please let me know if you have any questions. Thanks!

joanitolopo commented 8 months ago

Hi @danjohnvelasco

I am already pushing the requested changes.

Checkbox

I added code to handle both of source datasets (lexicon.xlsx and english.xlsx):

.....
.....
+ [
            SEACrowdConfig(
                name=f"{_DATASETNAME}_ext_{lang}_source",
                version=datasets.Version(_SOURCE_VERSION),
                description=f"beaye lexicon with source schema for extensive definiton of beaye language",
                schema="source",
                subset_id="beaye_lexicon",
            )
            for lang in _LANGUAGES if lang != "ind"
        ]

which I use word "ext" to make it difference with lexicon.xlsx

holylovenia commented 8 months ago

Hi @danjohnvelasco, could you please check @joanitolopo's changes addressed for your suggestions? Thanks!

danjohnvelasco commented 7 months ago

LGTM!