Closes #179 | Implement `indo_story_cloze` dataloader

chenxwh commented 9 months ago

Closes #179.

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[x] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py.
[x] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

chenxwh commented 8 months ago

Thanks for the comments! The modifications are incorporated into the latest commit.

danjohnvelasco commented 8 months ago

Code is working as expected on jupyter notebook.

However, running the test suite via python -m tests.test_seacrowd seacrowd/sea_datasets/indo_story_cloze/indo_story_cloze.py returns the following error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4415: character maps to <undefined>. I'm not sure if this only happens on my side.

This can be solved by specifying the encoding when you open the CSV file. For example:

data = csv.DictReader(open(filepath[split], newline="", encoding="utf-8"))

Other than this, I have no more issues with the code. Thank you for your work :)

chenxwh commented 8 months ago

Code is working as expected on jupyter notebook.

However, running the test suite via python -m tests.test_seacrowd seacrowd/sea_datasets/indo_story_cloze/indo_story_cloze.py returns the following error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4415: character maps to <undefined>. I'm not sure if this only happens on my side.

This can be solved by specifying the encoding when you open the CSV file. For example:
data = csv.DictReader(open(filepath[split], newline="", encoding="utf-8"))
Other than this, I have no more issues with the code. Thank you for your work :)

Thank you, added those!

ljvmiranda921 commented 8 months ago

LGTM! Any thoughts on the changes, @danjohnvelasco ?

chenxwh commented 7 months ago

Can this be closed and merged now?

danjohnvelasco commented 7 months ago

LGTM! Merging this now.

SEACrowd / seacrowd-datahub

Closes #179 | Implement `indo_story_cloze` dataloader #323

Checkbox