SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Closes #571 | Add/Update Dataloader UP2.0 #660

Closed fhudi closed 1 month ago

fhudi commented 2 months ago

Closes #571

Checkbox


Currently implemented as source-only, as there seems to be no matching task. (Based on the information provided in the card)

Issue was initially raised in the issue thread #571.

khelli07 commented 1 month ago

Unfortunately the test did not pass on my end. Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

image

khelli07 commented 1 month ago

Changed utils/common_parser.py > load_ud_data

dataset_raw = parse(open(filepath).read())

to

with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())

But please confirm if these result are correct or not

  1. Ran through loader if __name__ == "__main__": image

  2. A test sample: image

holylovenia commented 1 month ago

Hi @SamuelCahyawijaya @ljvmiranda921 @khelli07 and @fhudi, I won't merge this dataloader as-is due to the error reported by @khelli07. Let's wait until @fhudi addresses @ljvmiranda921 and @khelli07's feedback.

PS: Sorry I messed up by assigning this dataloader to 3 people. 🫠 I will adjust the reviewing scores later.

fhudi commented 1 month ago

Unfortunately the test did not pass on my end. Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2

@khelli07 Note that this is source-only dataloader.

As mentioned in the reviewing checklist point 3, you need to execute test script using the one that is appropriate under the test folder. Please use the one for source_only.

You can rever to the example of executable test script on how to use it.


P.S.: @holylovenia I can't find any readme file referring to how to test source-only dataset. Maybe lost when migrating from NusaCrowd 🥲.

fhudi commented 1 month ago

Changed utils/common_parser.py > load_ud_data

dataset_raw = parse(open(filepath).read())

to

with open(filepath, encoding='utf-8', errors='replace') as file:
    dataset_raw = parse(file.read())

But please confirm if these result are correct or not

@khelli07 Thanks for the suggestion. LGTM. But may I know the concise reasoning on these changes?

khelli07 commented 1 month ago

Hi, the change is to avoid this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

It appears when I ran datasets.load_dataset too.

After I include that change, I successfully ran datasets.load_dataset and python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/up2/up2.py --subset_id up2.

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

fhudi commented 1 month ago

Hi, the change is to avoid this error: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>

Could you try with the following combination and see whether it is solved with any of the followings:

  1. open(filepath, encoding='utf-8')
  2. open(filepath, errors='replace')

I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.

Are you using WSL?

khelli07 commented 1 month ago

Nope, I use Windows without WSL. Let me try it.

khelli07 commented 1 month ago

Ok just finished testing, both of them works. So you might be able to choose either one.

holylovenia commented 1 month ago

Hi @fhudi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

Has everything been addressed now? Can we merge this PR?

ljvmiranda921 commented 1 month ago

I think this is just waiting for @khelli07 's review. But from the last comment it seems that it has been resolved 🤔 . I was able to run this so I think it's mostly a platform-specific issue. Dataloader works though so I think it's safe to merge!

fhudi commented 1 month ago

@khelli07 Sorry for the late reply. Honestly, adding the parameter errors='replace' might change the overall behaviour. As this function is used by other DataLoader as well, I am a bit skeptical in including this change. Also, based on your reply, the root cause does not seem to be from the additional parameters. However, I have made the adjustment as seen from the commit. cc: @holylovenia @ljvmiranda921

fhudi commented 1 month ago

@khelli07 @ljvmiranda921 I believe this is save for merging?

sabilmakbar commented 1 month ago

open(filepath, encoding='utf-8') open(filepath, errors='replace')

Sorry for jumping it late. I think this is a platform-based issue where, for non-Linux/MacOS, the encoding doesn't default to utf-8. Hence, the first one is preferable.

sabilmakbar commented 1 month ago

Since both reviewers have given the approver (and the dataloader assignee best not to merge it themselves even though it has been approved), I'll proceed with the merging.

fyi @khelli07 @ljvmiranda921 @SamuelCahyawijaya