Closes #571 | Add/Update Dataloader UP2.0

fhudi commented 2 months ago

Closes #571

Checkbox

[x] Confirm that this PR is linked to the dataset issue.
[x] Create the dataloader script seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py (please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its __init__.py within {my_dataset} folder.
[x] Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _LOCAL, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _SEACROWD_VERSION variables.
[x] Implement _info(), _split_generators() and _generate_examples() in dataloader script.
[ ] Make sure that the BUILDER_CONFIGS class attribute is a list with at least one SEACrowdConfig for the source schema and one for a seacrowd schema.
[x] Confirm dataloader script works with datasets.load_dataset function.
[x] Confirm that your dataloader script passes the test suite run with python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py or python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}.
[ ] If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

Currently implemented as source-only, as there seems to be no matching task. (Based on the information provided in the card)

Issue was initially raised in the issue thread #571.

khelli07 commented 1 month ago

Unfortunately the test did not pass on my end. Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

khelli07 commented 1 month ago

Changed utils/common_parser.py > load_ud_data

dataset_raw = parse(open(filepath).read())

to

with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())

But please confirm if these result are correct or not

Ran through loader if __name__ == "__main__":
A test sample:

holylovenia commented 1 month ago

Hi @SamuelCahyawijaya @ljvmiranda921 @khelli07 and @fhudi, I won't merge this dataloader as-is due to the error reported by @khelli07. Let's wait until @fhudi addresses @ljvmiranda921 and @khelli07's feedback.

PS: Sorry I messed up by assigning this dataloader to 3 people. 🫠 I will adjust the reviewing scores later.

fhudi commented 1 month ago

Unfortunately the test did not pass on my end. Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

@khelli07 Note that this is source-only dataloader.

As mentioned in the reviewing checklist point 3, you need to execute test script using the one that is appropriate under the test folder. Please use the one for source_only.

You can rever to the example of executable test script on how to use it.

P.S.: @holylovenia I can't find any readme file referring to how to test source-only dataset. Maybe lost when migrating from NusaCrowd 🥲.

fhudi commented 1 month ago

Changed utils/common_parser.py > load_ud_data
dataset_raw = parse(open(filepath).read())
to
with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())
But please confirm if these result are correct or not

@khelli07 Thanks for the suggestion. LGTM. But may I know the concise reasoning on these changes?

khelli07 commented 1 month ago

Hi, the change is to avoid this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

It appears when I ran datasets.load_dataset too.

After I include that change, I successfully ran datasets.load_dataset and python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/up2/up2.py --subset_id up2.

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

fhudi commented 1 month ago

Hi, the change is to avoid this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

Could you try with the following combination and see whether it is solved with any of the followings:

open(filepath, encoding='utf-8')
open(filepath, errors='replace')

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

Are you using WSL?

khelli07 commented 1 month ago

Nope, I use Windows without WSL. Let me try it.

khelli07 commented 1 month ago

Ok just finished testing, both of them works. So you might be able to choose either one.

holylovenia commented 1 month ago

Hi @fhudi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Has everything been addressed now? Can we merge this PR?

ljvmiranda921 commented 1 month ago

I think this is just waiting for @khelli07 's review. But from the last comment it seems that it has been resolved 🤔 . I was able to run this so I think it's mostly a platform-specific issue. Dataloader works though so I think it's safe to merge!

fhudi commented 1 month ago

@khelli07 Sorry for the late reply. Honestly, adding the parameter errors='replace' might change the overall behaviour. As this function is used by other DataLoader as well, I am a bit skeptical in including this change. Also, based on your reply, the root cause does not seem to be from the additional parameters. However, I have made the adjustment as seen from the commit. cc: @holylovenia @ljvmiranda921

fhudi commented 1 month ago

@khelli07 @ljvmiranda921 I believe this is save for merging?

sabilmakbar commented 1 month ago

open(filepath, encoding='utf-8') open(filepath, errors='replace')

Sorry for jumping it late. I think this is a platform-based issue where, for non-Linux/MacOS, the encoding doesn't default to utf-8. Hence, the first one is preferable.

sabilmakbar commented 1 month ago

Since both reviewers have given the approver (and the dataloader assignee best not to merge it themselves even though it has been approved), I'll proceed with the merging.

fyi @khelli07 @ljvmiranda921 @SamuelCahyawijaya

SEACrowd / seacrowd-datahub

Closes #571 | Add/Update Dataloader UP2.0 #660

Checkbox