IndoNLP / nusa-crowd

A collaborative project to collect datasets in Indonesian languages.
Apache License 2.0
260 stars 60 forks source link

The dataset kopi_cc with the config name kopi_cc_2022_05-neardup_clean_nusantara_ssp is not complete #337

Open cahya-wirawan opened 1 year ago

cahya-wirawan commented 1 year ago

Describe the bug

It seems that the dataset kopi_cc with the config name kopi_cc_2022_05-neardup_clean_nusantara_ssp is not complete. It can't find the file cleaned_oscar-neardup-000000000019.json.gz and above. According to the source code, there should be files from cleaned_oscar-neardup-000000000001.json.gz to cleaned_oscar-neardup-000000000035.json.gz. Or maybe the list should be only up to 18.

Steps to reproduce the bug


# Sample code to reproduce the bug
from datasets import load_dataset

ds = load_dataset("./nusacrowd/nusa_datasets/kopi_cc", "kopi_cc_2022_05-neardup_clean_nusantara_ssp")

## Expected results
A clear and concise description of the expected results.

## Actual results
Downloading and preparing dataset kopi_cc/kopi_cc_2022_05-neardup_clean_nusantara_ssp to /home/cahya/.cache/huggingface/datasets/kopi_cc/kopi_cc_2022_05-neardup_clean_nusantara_ssp/1.0.0/c65901d6126a9e2fc45208767b8f69ba70a4c12a2093f7399fe420bae78c0c05...
Downloading data files:   0%|                                                                                                       | 0/35 [00:00<?, ?it/s]Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cahya/Work/miniconda3/envs/transformers/lib/python3.9/site-packages/datasets/load.py", line 1757, in load_dataset
    builder_instance.download_and_prepare(
  ...

  File "/home/cahya/Work/miniconda3/envs/transformers/lib/python3.9/site-packages/datasets/utils/py_utils.py", line 346, in _single_map_nested
    return function(data_struct)
  File "/home/cahya/Work/miniconda3/envs/transformers/lib/python3.9/site-packages/datasets/download/download_manager.py", line 357, in _download
    return cached_path(url_or_filename, download_config=download_config)
  File "/home/cahya/Work/miniconda3/envs/transformers/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 183, in cached_path
    output_path = get_from_cache(
  File "/home/cahya/Work/miniconda3/envs/transformers/lib/python3.9/site-packages/datasets/utils/file_utils.py", line 530, in get_from_cache
    raise FileNotFoundError(f"Couldn't find file at {url}")
FileNotFoundError: Couldn't find file at https://huggingface.co/datasets/munggok/KoPI-CC/resolve/main/2022_05/neardup_clean/cleaned_oscar-neardup-000000000019.json.gz

## Environment info
<!-- You can run the command `datasets-cli env` and copy-and-paste its output below. -->
- `datasets` version: 2.8.1.dev0
- Platform: Linux-5.13.0-1027-gcp-x86_64-with-glibc2.31
- Python version: 3.9.15
- PyArrow version: 8.0.0
- Pandas version: 1.5.1
SamuelCahyawijaya commented 1 year ago

@cahya-wirawan , sorry for the late reply. Tagging Pak @munggok to help solving this issue.

acul3 commented 1 year ago

halo mas cahya @cahya-wirawan sorry took so long 😆

i will update it tommorow with new additional snapshot