Closed fhudi closed 1 month ago
Unfortunately the test did not pass on my end.
Command I ran: python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2
Changed utils/common_parser.py > load_ud_data
dataset_raw = parse(open(filepath).read())
to
with open(filepath, encoding='utf-8', errors='replace') as file:
dataset_raw = parse(file.read())
But please confirm if these result are correct or not
Ran through loader if __name__ == "__main__"
:
A test sample:
Hi @SamuelCahyawijaya @ljvmiranda921 @khelli07 and @fhudi, I won't merge this dataloader as-is due to the error reported by @khelli07. Let's wait until @fhudi addresses @ljvmiranda921 and @khelli07's feedback.
PS: Sorry I messed up by assigning this dataloader to 3 people. 🫠 I will adjust the reviewing scores later.
Unfortunately the test did not pass on my end. Command I ran:
python -m tests.test_seacrowd seacrowd/sea_datasets/up2/up2.py --subset_id up2
@khelli07 Note that this is source-only dataloader.
As mentioned in the reviewing checklist point 3, you need to execute test script using the one that is appropriate under the test folder. Please use the one for source_only.
You can rever to the example of executable test script on how to use it.
P.S.: @holylovenia I can't find any readme file referring to how to test source-only dataset. Maybe lost when migrating from NusaCrowd 🥲.
Changed
utils/common_parser.py > load_ud_data
dataset_raw = parse(open(filepath).read())
to
with open(filepath, encoding='utf-8', errors='replace') as file: dataset_raw = parse(file.read())
But please confirm if these result are correct or not
@khelli07 Thanks for the suggestion. LGTM. But may I know the concise reasoning on these changes?
Hi, the change is to avoid this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>
It appears when I ran datasets.load_dataset
too.
After I include that change, I successfully ran datasets.load_dataset
and python -m tests.test_seacrowd_source_only seacrowd/sea_datasets/up2/up2.py --subset_id up2
.
I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.
Hi, the change is to avoid this error:
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 282814: character maps to <undefined>
Could you try with the following combination and see whether it is solved with any of the followings:
open(filepath, encoding='utf-8')
open(filepath, errors='replace')
I'm not sure if this is particular to Windows only or not. After the changes, the code might need to be tested on other OS too.
Are you using WSL?
Nope, I use Windows without WSL. Let me try it.
Ok just finished testing, both of them works. So you might be able to choose either one.
Hi @fhudi @khelli07, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
Has everything been addressed now? Can we merge this PR?
I think this is just waiting for @khelli07 's review. But from the last comment it seems that it has been resolved 🤔 . I was able to run this so I think it's mostly a platform-specific issue. Dataloader works though so I think it's safe to merge!
@khelli07
Sorry for the late reply.
Honestly, adding the parameter errors='replace'
might change the overall behaviour.
As this function is used by other DataLoader as well, I am a bit skeptical in including this change.
Also, based on your reply, the root cause does not seem to be from the additional parameters.
However, I have made the adjustment as seen from the commit.
cc: @holylovenia @ljvmiranda921
@khelli07 @ljvmiranda921 I believe this is save for merging?
open(filepath, encoding='utf-8') open(filepath, errors='replace')
Sorry for jumping it late. I think this is a platform-based issue where, for non-Linux/MacOS, the encoding doesn't default to utf-8
. Hence, the first one is preferable.
Since both reviewers have given the approver (and the dataloader assignee best not to merge it themselves even though it has been approved), I'll proceed with the merging.
fyi @khelli07 @ljvmiranda921 @SamuelCahyawijaya
Closes #571
Checkbox
seacrowd/sea_datasets/{my_dataset}/{my_dataset}.py
(please use only lowercase and underscore for dataset folder naming, as mentioned in dataset issue) and its__init__.py
within{my_dataset}
folder._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_LOCAL
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
orpython -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py --subset_id {subset_name_without_source_or_seacrowd_suffix}
.Currently implemented as source-only, as there seems to be no matching task. (Based on the information provided in the card)
Issue was initially raised in the issue thread #571.