SEACrowd / seacrowd-datahub

A collaborative project to collect datasets in SEA languages, SEA regions, or SEA cultures.
Apache License 2.0
55 stars 54 forks source link

Closes #221 | Add Dataloader NUS SMS Corpus #596

Open akhdanfadh opened 3 months ago

akhdanfadh commented 3 months ago

Closes #221

I implemented one config per language/subset. Thus, configs will look like this: nus_sms_corpus_eng_source, nus_sms_corpus_cmn_seacrowd_ssp, etc. When testing, pass nus_sms_corpus_<subset> to the --subset_id parameter.

Checkbox

akhdanfadh commented 2 months ago

I've run the make check_file, please double-check.

I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

@raileymontalan could you give your test result?

raileymontalan commented 2 months ago

I've run the make check_file, please double-check.

I am getting the error message KeyError: '$' when trying to load the dataset. Please advise.

@raileymontalan could you give your test result?

Hi @akhdanfadh, I am using a MacBook, so the issue could be related to this. Please see the error message here:

(env-seacrowd) raileymontalan@Raileys-MacBook-Pro-2023 seacrowd-datahub % python -m tests.test_seacrowd seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py --subset_id="nus_sms_corpus_eng"
INFO:__main__:args: Namespace(path='seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py', schema=None, subset_id='nus_sms_corpus_eng', data_dir=None, use_auth_token=None)
INFO:__main__:self.PATH: seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
INFO:__main__:self.SUBSET_ID: nus_sms_corpus_eng
INFO:__main__:self.SCHEMA: None
INFO:__main__:self.DATA_DIR: None
INFO:__main__:Checking for _SUPPORTED_TASKS ...
module seacrowd.sea_datasets.nus_sms_corpus.nus_sms_corpus
INFO:__main__:Found _SUPPORTED_TASKS=[<Tasks.SELF_SUPERVISED_PRETRAINING: 'SSP'>]
INFO:__main__:_SUPPORTED_TASKS implies _MAPPED_SCHEMAS={'SSP'}
INFO:__main__:schemas_to_check: {'SSP'}
INFO:__main__:Checking load_dataset with config name nus_sms_corpus_eng_source
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:2483: FutureWarning: 'use_auth_token' was deprecated in favor of 'token' in version 2.14.0 and will be removed in 3.0.0.
You can remove this warning by passing 'token=<use_auth_token>' instead.
  warnings.warn(
/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py:922: FutureWarning: The repository for nus_sms_corpus contains custom code which must be executed to correctly load the dataset. You can inspect the repository content at seacrowd/sea_datasets/nus_sms_corpus/nus_sms_corpus.py
You can avoid this message in future by passing the argument `trust_remote_code=True`.
Passing `trust_remote_code=True` will be mandatory to load this dataset from the next major release of `datasets`.
  warnings.warn(
Generating train split: 0 examples [00:01, ? examples/s]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1743, in _prepare_split_single
    example = self.info.features.encode_example(record) if self.info.features is not None else record
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1878, in encode_example
    return encode_nested_example(self, example)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1244, in <dictcomp>
    k: encode_nested_example(sub_schema, sub_obj, level=level + 1)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in encode_nested_example
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/features/features.py", line 1243, in <dictcomp>
    {
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in zip_dict
    yield key, tuple(d[key] for d in dicts)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/utils/py_utils.py", line 326, in <genexpr>
    yield key, tuple(d[key] for d in dicts)
KeyError: '$'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/raileymontalan/Documents/seacrowd-datahub/tests/test_seacrowd.py", line 134, in setUp
    self.dataset_source = datasets.load_dataset(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/load.py", line 2549, in load_dataset
    builder_instance.download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1005, in download_and_prepare
    self._download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1767, in _download_and_prepare
    super()._download_and_prepare(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1100, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1605, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "/Users/raileymontalan/miniconda3/envs/env-seacrowd/lib/python3.10/site-packages/datasets/builder.py", line 1762, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset

----------------------------------------------------------------------
Ran 1 test in 3.052s

FAILED (errors=1)
akhdanfadh commented 2 months ago

@raileymontalan I'm not sure about the macbook issue since I able to test the code in both Ubuntu and MacOS as well (see image below). Since the error is KeyError, I'm guessing it is about the python itself(?), or something in your environment.

image
holylovenia commented 1 month ago

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server.

holylovenia commented 1 month ago

Hi @raileymontalan, a friendly reminder to review once you have the time. 👍

holylovenia commented 1 month ago

Hi @raileymontalan, I would like to let you know that we plan to finalize the calculation of the open contributions (e.g., dataloader implementations) in 31 hours, so it'd be great if we could wrap up the reviewing and merge this PR before then.

cc: @akhdanfadh

sabilmakbar commented 1 month ago

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server. Do you have a different versions of datasets on the Mac vs Server? prob that was the case

in my end, the data generated has the key of $ generated iteratively, which is a bit unexpected to the feature list.

image

Prob adding additional conditions of creating $ cols only if the element.text is available (not None) is a best workaround for now

sabilmakbar commented 1 month ago

update: works for eng subset, but still looking the cause for cmn subset

raileymontalan commented 1 month ago

Hi @raileymontalan, can you try running it on Linux-based OS? When I tried on Mac, it gave the same error as yours, but I managed to run it without any issues on the server. Do you have a different versions of datasets on the Mac vs Server? prob that was the case

in my end, the data generated has the key of $ generated iteratively, which is a bit unexpected to the feature list. image

Prob adding additional conditions of creating $ cols only if the element.text is available (not None) is a best workaround for now

Still getting the same issues as before when testing on Mac. My datasets version is 2.16.1

holylovenia commented 14 hours ago

Hi @akhdanfadh, thank you for contributing to SEACrowd! I would like to let you know that we are still looking forward to completing this PR (and dataloader issues) and maintaining SEACrowd Data Hub. We hope to enable access to as many standardized dataloaders as possible for SEA datasets. ☺️

Feel free to continue the PR whenever you're available, and if you would like to re-assign this dataloader to someone else, just let us know and we can help. 💪

Thanks again!

PS: If the issue still persists on MacOS and we cannot find a workaround, should we just wrap it up and add a note that it's only usable for Linux in the _DESCRIPTION?

cc: @raileymontalan @sabilmakbar