SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
65 stars 57 forks source link

Closes #179 | Implement `indo_story_cloze` dataloader #323

Closed chenxwh closed 7 months ago

chenxwh commented 9 months ago

Closes #179.

Checkbox

chenxwh commented 8 months ago

Thanks for the comments! The modifications are incorporated into the latest commit.

danjohnvelasco commented 8 months ago

Code is working as expected on jupyter notebook.

However, running the test suite via python -m tests.test_seacrowd seacrowd/sea_datasets/indo_story_cloze/indo_story_cloze.py returns the following error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4415: character maps to <undefined>. I'm not sure if this only happens on my side.

This can be solved by specifying the encoding when you open the CSV file. For example:

data = csv.DictReader(open(filepath[split], newline="", encoding="utf-8"))

Other than this, I have no more issues with the code. Thank you for your work :)

chenxwh commented 8 months ago

Code is working as expected on jupyter notebook.

However, running the test suite via python -m tests.test_seacrowd seacrowd/sea_datasets/indo_story_cloze/indo_story_cloze.py returns the following error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 4415: character maps to <undefined>. I'm not sure if this only happens on my side.

This can be solved by specifying the encoding when you open the CSV file. For example:

data = csv.DictReader(open(filepath[split], newline="", encoding="utf-8"))

Other than this, I have no more issues with the code. Thank you for your work :)

Thank you, added those!

ljvmiranda921 commented 8 months ago

LGTM! Any thoughts on the changes, @danjohnvelasco ?

chenxwh commented 7 months ago

Can this be closed and merged now?

danjohnvelasco commented 7 months ago

LGTM! Merging this now.