AMAAI-Lab / MidiCaps

A large-scale dataset of caption-annotated MIDI files.
MIT License
45 stars 1 forks source link

Hugging Face Dataset seems to be corrupted :( #3

Closed asigalov61 closed 2 months ago

asigalov61 commented 3 months ago

Hey @dorienh @elchico1990 @Dapwner @ismirsubmission198

I wanted to try MidiCaps today but it seems that the dataset (json files) are corrupted. Here is the code and the traceback:

mc_dataset = load_dataset("amaai-lab/MidiCaps")

Generating train split: 
 168385/0 [00:02<00:00, 195311.74 examples/s]
Failed to load JSON from file '/root/.cache/huggingface/datasets/downloads/3307e000d26ff30aa307ac62029c7e215e9691a75780dc2714afd1493a96e2f9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
ERROR:datasets.packaged_modules.json.json:Failed to load JSON from file '/root/.cache/huggingface/datasets/downloads/3307e000d26ff30aa307ac62029c7e215e9691a75780dc2714afd1493a96e2f9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
    152                                 ) as f:
--> 153                                     df = pd.read_json(f, dtype_backend="pyarrow")
    154                             except ValueError:

16 frames
[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, precise_float, date_unit, encoding, encoding_errors, lines, chunksize, compression, nrows, storage_options, dtype_backend, engine)
    783     else:
--> 784         return json_reader.read()
    785 

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in read(self)
    974                 else:
--> 975                     obj = self._get_object_parser(self.data)
    976                 if self.dtype_backend is not lib.no_default:

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in _get_object_parser(self, json)
   1000         if typ == "frame":
-> 1001             obj = FrameParser(json, **kwargs).parse()
   1002 

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in parse(self)
   1133     def parse(self):
-> 1134         self._parse()
   1135 

[/usr/local/lib/python3.10/dist-packages/pandas/io/json/_json.py](https://localhost:8080/#) in _parse(self)
   1319             self.obj = DataFrame(
-> 1320                 loads(json, precise_float=self.precise_float), dtype=None
   1321             )

ValueError: Trailing data

During handling of the above exception, another exception occurred:

ArrowInvalid                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   1996                 _time = time.time()
-> 1997                 for _, table in generator:
   1998                     if max_shard_size is not None and writer._num_bytes > max_shard_size:

[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
    155                                 logger.error(f"Failed to load JSON from file '{file}' with error {type(e)}: {e}")
--> 156                                 raise e
    157                             if df.columns.tolist() == [0]:

[/usr/local/lib/python3.10/dist-packages/datasets/packaged_modules/json/json.py](https://localhost:8080/#) in _generate_tables(self, files)
    129                                 try:
--> 130                                     pa_table = paj.read_json(
    131                                         io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size)

/usr/local/lib/python3.10/dist-packages/pyarrow/_json.pyx in pyarrow._json.read_json()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/usr/local/lib/python3.10/dist-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: JSON parse error: Column(/genre/[]/[]) changed from string to number in row 0

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)
[<ipython-input-12-661da53cceac>](https://localhost:8080/#) in <cell line: 1>()
----> 1 mc_dataset = load_dataset("amaai-lab/MidiCaps")

[/usr/local/lib/python3.10/dist-packages/datasets/load.py](https://localhost:8080/#) in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, trust_remote_code, **config_kwargs)
   2614 
   2615     # Download and prepare data
-> 2616     builder_instance.download_and_prepare(
   2617         download_config=download_config,
   2618         download_mode=download_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, **download_and_prepare_kwargs)
   1027                         if num_proc is not None:
   1028                             prepare_split_kwargs["num_proc"] = num_proc
-> 1029                         self._download_and_prepare(
   1030                             dl_manager=dl_manager,
   1031                             verification_mode=verification_mode,

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _download_and_prepare(self, dl_manager, verification_mode, **prepare_split_kwargs)
   1122             try:
   1123                 # Prepare split will record examples associated to the split
-> 1124                 self._prepare_split(split_generator, **prepare_split_kwargs)
   1125             except OSError as e:
   1126                 raise OSError(

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size)
   1882             job_id = 0
   1883             with pbar:
-> 1884                 for job_id, done, content in self._prepare_split_single(
   1885                     gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args
   1886                 ):

[/usr/local/lib/python3.10/dist-packages/datasets/builder.py](https://localhost:8080/#) in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id)
   2038             if isinstance(e, DatasetGenerationError):
   2039                 raise
-> 2040             raise DatasetGenerationError("An error occurred while generating the dataset") from e
   2041 
   2042         yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset

I would really appreciate if you can fix this error soon :)

Sincerely,

Alex

elchico1990 commented 3 months ago

Hi @asigalov61, Thanks for raising this issue. We are looking into the error. If you wish to look at the dataset asap, please manually download the dataset. Thanks

asigalov61 commented 3 months ago

@elchico1990 You are welcome and I appreciate your fast response.

Where can I manually download the dataset? I tried downloading json files from Hugging Face but they also seem to be corrupted. Is there an alt download link?

Thank you,

Alex.

Dapwner commented 2 months ago

Hi, we updated the .json file, the dataset should now be downloadable through load_dataset("amaai-lab/MidiCaps")

In that process, we merged all the 3 versions of our .json files into a single one. Give it a try!

Closing the issue.

asigalov61 commented 2 months ago

@Dapwner Yes, thank you for your support! :) Everything seems to work fine now :)

I will try it on a sentence transformer implementation and let you know the results :)

Thanks again

Alex