Open akhdanfadh opened 3 months ago
I've run the make check_file
, please double-check.
I am getting the error message
KeyError: '$'
when trying to load the dataset. Please advise.
@raileymontalan could you give your test result?
I've run the
make check_file
, please double-check.I am getting the error message
KeyError: '$'
when trying to load the dataset. Please advise.@raileymontalan could you give your test result?
Hi @akhdanfadh, I am using a MacBook, so the issue could be related to this. Please see the error message here:
(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub % python -m tests.test_seacrowd seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py --subset_id="nus_sms_corpus_eng"
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py', schema=None, subset_id='nus_sms_corpus_eng', data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
INFO:__main__:self.SUBSET_ID: nus_sms_corpus_eng
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.nus_sms_corpus.nus_sms_corpus
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.SELF_SUPERVISED_PRETRAINING: 'SSP'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SSP'}
INFO:__main__:schemas_to_check: {'SSP'}
INFO:__main__:Checking load_dataset with config name nus_sms_corpus_eng_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for nus_sms_corpus contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
warnings.warn(
Generating train split: 0 examples [00:01, ? examples/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1743, in _prepare_split_single
example = self.info.features.encode_example(record) if self.info.features is not None else record
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1878, in encode_example
return encode_nested_example(self, example)
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
{
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
{
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
{
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in <dictcomp>
{
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in zip_dict
yield key, tuple(d[key] for d in dicts)
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
yield key, tuple(d[key] for d in dicts)
KeyError: '$'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
self.dataset_source = datasets.load_dataset(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
builder_instance.download_and_prepare(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
self._download_and_prepare(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1767, in _download_and_prepare
super()._download_and_prepare(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1605, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1762, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
----------------------------------------------------------------------
Ran 1 test in 3.052s
FAILED (errors=1)
@raileymontalan I'm not sure about the macbook issue since I able to test the code in both Ubuntu and MacOS as well (see image below). Since the error is KeyError, I'm guessing it is about the python itself(?), or something in your environment.
Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.
Hi @raileymontalan, a friendly reminder to review once you have the time. 👍
Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.
cc: @akhdanfadh
Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server. Do you have a different versions of
datasets
on the Mac vs Server? prob that was the case
in my end, the data generated has the key of $
generated iteratively, which is a bit unexpected to the feature list.
Prob adding additional conditions of creating $
cols only if the element.text
is available (not None
) is a best workaround for now
update: works for eng
subset, but still looking the cause for cmn
subset
Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server. Do you have a different versions of
datasets
on the Mac vs Server? prob that was the casein my end, the data generated has the key of
$
generated iteratively, which is a bit unexpected to the feature list.Prob adding additional conditions of creating
$
cols only if theelement.text
is available (notNone
) is a best workaround for now
Still getting the same issues as before when testing on Mac. My datasets
version is 2.16.1
Hi @akhdanfadh, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️
Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪
Thanks again!
PS: If the issue still persists on MacOS and we cannot find a workaround, should we just wrap it up and add a note that it's only usable for Linux in the _DESCRIPTION
?
cc: @raileymontalan @sabilmakbar
Closes #221
I implemented one config per language/subset. Thus, configs will look like this:
nus_sms_corpus_eng_source
,nus_sms_corpus_cmn_seacrowd_ssp
, etc. When testing, passnus_sms_corpus_<subset>
to the--subset_id
parameter.Checkbox
seacrowd/sea_datasets/my_dataset/my_dataset.py
(please use only lowercase and underscore for dataset naming)._CITATION
,_DATASETNAME
,_DESCRIPTION
,_HOMEPAGE
,_LICENSE
,_URLs
,_SUPPORTED_TASKS
,_SOURCE_VERSION
, and_SEACROWD_VERSION
variables._info()
,_split_generators()
and_generate_examples()
in dataloader script.BUILDER_CONFIGS
class attribute is a list with at least oneSEACrowdConfig
for the source schema and one for a seacrowd schema.datasets.load_dataset
function.python -m tests.test_seacrowd seacrowd/sea_datasets/<my_dataset>/<my_dataset>.py
.