Closes #449 | Add/Update Dataloader Thai LOTUS

sabilmakbar commented 2 months ago

Closes #449

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/tha_lotus/tha_lotus.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with:
1. python -m tests.test_seacrowd seacrowd/sea_datasets/tha_lotus/tha_lotus.py --subset_id tha_lotus_closetalk_clean
2. python -m tests.test_seacrowd seacrowd/sea_datasets/tha_lotus/tha_lotus.py --subset_id tha_lotus_closetalk_office
3. python -m tests.test_seacrowd seacrowd/sea_datasets/tha_lotus/tha_lotus.py --subset_id tha_lotus_unidrection_clean
4. python -m tests.test_seacrowd seacrowd/sea_datasets/tha_lotus/tha_lotus.py --subset_id tha_lotus_unidrection_office
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

holylovenia commented 1 month ago

python -m tests.test_seacrowd seacrowd/sea_datasets/tha_lotus/tha_lotus.py --subset_id tha_lotus_closetalk_clean

Traceback (most recent call last):
  File "/Users/faridadilazuarda/miniconda3/envs/env-seacrowd/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/Users/faridadilazuarda/miniconda3/envs/env-seacrowd/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/Users/faridadilazuarda/Documents/GitHub/seacrowd-datahub/tests/test_seacrowd.py", line 14, in <module>
    from seacrowd.utils.constants import Tasks, TASK_TO_SCHEMA, VALID_TASKS, VALID_SCHEMAS, SCHEMA_TO_FEATURES, TASK_TO_FEATURES
  File "/Users/faridadilazuarda/Documents/GitHub/seacrowd-datahub/seacrowd/__init__.py", line 1, in <module>
    from .utils.constants import Tasks
  File "/Users/faridadilazuarda/Documents/GitHub/seacrowd-datahub/seacrowd/utils/constants.py", line 40, in <module>
    class Tasks(Enum):
  File "/Users/faridadilazuarda/Documents/GitHub/seacrowd-datahub/seacrowd/utils/constants.py", line 131, in Tasks
    OPTICAL_CHARACTER_RECOGNITION = "OCR"
  File "/Users/faridadilazuarda/miniconda3/envs/env-seacrowd/lib/python3.10/enum.py", line 134, in __setitem__
    raise TypeError('Attempted to reuse key: %r' % key)
TypeError: Attempted to reuse key: 'OPTICAL_CHARACTER_RECOGNITION'

Hello, any idea what is the cause for this error? @sabilmakbar @holylovenia

Can you pull from master and see if the error persists or not, @faridlazuarda? Previously constants.py had duplicated lines for OPTICAL_CHARACTER_RECOGNITION, that was what triggered the error for me.

sabilmakbar commented 1 month ago

okay, thanks for the reviews, @holylovenia and @faridlazuarda

SEACrowd / seacrowd-datahub

Closes #449 | Add/Update Dataloader Thai LOTUS #655

Checkbox