Closes #10 | Create beaye_lexicon dataset loader

SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.

Apache License 2.0

65 stars 57 forks source link

Closes #10 | Create beaye_lexicon dataset loader #320

Closed joanitolopo closed 7 months ago

joanitolopo commented 9 months ago

Closes #10

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

danjohnvelasco commented 8 months ago

Hi @joanitolopo👋, thanks for the PR! Below are my requested changes and queries/concerns.

Requested changes:

[ ] Please add blank __init__.py inside the dataloader directory
[ ] Please add _LOCAL = False variable
[ ] Test suite fails on current code. I encourage you to run test suite on your side via: python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/beaye_lexicon/beaye_lexicon.py
[ ] Please run the code formatter via: make check_file=seacrowd/sea_datasets/beaye_lexicon/beaye_lexicon.py

For my queries/concerns:

I've read that this is a lexicon containing translations between Indonesian, English, and Beaye words. However, in the current implementation, I only see list of ind words. Since this is a lexicon with parallel translations, shouldn't we also include the translations and not just the ind words?
There are 2 subsets from the source dataset, the lexicon.xlsx (for ind <-> day) and english.xlsx (for eng <-> day). If eng <-> day is of interest for SEACrowd, it should also be implemented as different subsets. They will have different config names (e.g. beaye_lexicon_ind_day_souce and beaye_lexicon_eng_day_souce). You might want to modify how the current code creates BUILDER_CONFIGS at line 59.

Please do let me know if you have any questions or disagreements. We can discuss it here. Thanks!

joanitolopo commented 8 months ago

Hi @danjohnvelasco 👋, thanks for the requested changes and queries/concerns.

Requested changes:

Test suite fails on current code. I encourage you to run test suite on your side via: python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/beaye_lexicon/beaye_lexicon.py

Since the dataloader doesn't implement SEACrowd schema, we need to use optional argument --subset_id. So the testing should like this: python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/beaye_lexicon/beaye_lexicon.py --subset_id beaye_lexicon_ind. I also had an error before and the suggestion is used argument (https://github.com/SEACrowd/seacrowd-datahub/issues/10#issuecomment-1878055738)

Queries/concerns:

I've read that this is a lexicon containing translations between Indonesian, English, and Beaye words. However, in the current implementation, I only see list of ind words. Since this is a lexicon with parallel translations, shouldn't we also include the translations and not just the ind words?

I am not sure to apply translation task to this dataloader since this only contain word-pairs. If it possible to apply, that means it will be a translation task and also we need to apply seacrowd scema too? Whats your take on this? @danjohnvelasco @holylovenia @sabilmakbar

There are 2 subsets from the source dataset, the lexicon.xlsx (for ind <-> day) and english.xlsx (for eng <-> day). If eng <-> day is of interest for SEACrowd, it should also be implemented as different subsets. They will have different config names (e.g. beaye_lexicon_ind_day_souce and beaye_lexicon_eng_day_souce). You might want to modify how the current code creates BUILDER_CONFIGS at line 59.

Unfortunately both source datasets are not parallel. In lexicon.xlsx contains cleaned data which selected and evaluated by linguistic experts. Meanwhile, english.xlsx is another parallel translation of extensive english definition of beaye words

Thanks!

holylovenia commented 8 months ago

I've read that this is a lexicon containing translations between Indonesian, English, and Beaye words. However, in the current implementation, I only see list of ind words. Since this is a lexicon with parallel translations, shouldn't we also include the translations and not just the ind words?

I am not sure to apply translation task to this dataloader since this only contain word-pairs. If it possible to apply, that means it will be a translation task and also we need to apply seacrowd scema too? Whats your take on this? @danjohnvelasco @holylovenia @sabilmakbar

I think it's fine not to have the t2t schema since we don't usually use word pairs for Tasks.MACHINE_TRANSLATION.

danjohnvelasco commented 8 months ago

Hi @joanitolopo

In that case, maybe the source schema only should be implemented. For lexicon.xlsx, the dataloader should show two features: ind and day.

I updated the requested changes below:

[ ] Please add blank init.py inside the dataloader directory
[ ] Please add _LOCAL = False variable
[ ] Please modify current source implementation

Please let me know if you have any questions. Thanks!

joanitolopo commented 8 months ago

Hi @danjohnvelasco

I am already pushing the requested changes.

Checkbox

[x] Please add blank init.py inside the dataloader directory
[x] Please add _LOCAL = False variable
[x] Please modify current source implementation

I added code to handle both of source datasets (lexicon.xlsx and english.xlsx):

.....
.....
+ [
            SEACrowdConfig(
                name=f"{_DATASETNAME}_ext_{lang}_source",
                version=datasets.Version(_SOURCE_VERSION),
                description=f"beaye lexicon with source schema for extensive definiton of beaye language",
                schema="source",
                subset_id="beaye_lexicon",
            )
            for lang in _LANGUAGES if lang != "ind"
        ]

which I use word "ext" to make it difference with lexicon.xlsx

holylovenia commented 8 months ago

Hi @danjohnvelasco, could you please check @joanitolopo's changes addressed for your suggestions? Thanks!

danjohnvelasco commented 7 months ago

LGTM!