Closed sufengniu closed 1 year ago
I found I incorrectly use the huggingface dataset command. Following is worked version, that first download the data into local and then
oig_data = load_dataset("json", data_files="your/data/dir")
I would like re-open the issue since I found even the above way would eventually got the error:
ValueError: Couldn't cast
text: string
metadata: struct<source: string>
child 0, source: string
to
{'text': Value(dtype='string', id=None)}
because column names don't match
I'm facing the same issue. Did you figure out?
This works:
ds = datasets.load_dataset("laion/OIG", use_auth_token=True, streaming=True)
for x in ds:
print(x)
break
Hi all - I am pushing new versions of the datasets with the metadata field. Also files should have both a text and metadata field. If there are still problems please let me know, and in particular which files. As an aside, if you load using HF datasets, it might try to load all of the jsonl files, which you may not want. Instead, you may wish to download individual files and load via the json method.
The issue still persists.
I tried to download unified_flan.jsonl.gz:
It seems the data is not in .gz format.
datasets.load_dataset("json", data_files=["unified_flan.jsonl.gz"])
fails DatasetGenerationError: An error occurred while generating the dataset
guzip unified_flan.jsonl.gz compains data is not in .gz format.
file unified_flan.jsonl.gz
outputs
unified_flan.jsonl.gz HTML document, UTF-8 Unicode text, with very long lines
how to extract?
I still have similar error, but after I use new HF datasets, the error changed to be:
Generating train split: 4046401 examples [03:53, 4568.92 examples/s]Failed to read file '/scratch1/jiajinn/cache/huggingface/datasets/downloads/93711c64efd043b64d8ef390c31794f36218c1678bb107e8c7e6057b123d2ef9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a closing quotation mark in string. in row 37
ArrowInvalid: JSON parse error: Missing a closing quotation mark in string. in row 37
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last)
my guess is something might be wrong in some data column?
I am also using the latest HF. I downloaded only the unified_flan.jsonl.gz with wget and was loading that.
ds = datasets.load_dataset("json", data_files=["unified_flan.jsonl.gz"], cache_dir="/mnt/huggingface_cache/")
The complete error I got was
Failed to read file 'unified_flan.jsonl.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0
JSONDecodeError Traceback (most recent call last) /anaconda/envs/nlp/lib/python3.7/site-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files) 151 with open(file, encoding="utf-8") as f: --> 152 dataset = json.load(f) 153 except json.JSONDecodeError:
/anaconda/envs/nlp/lib/python3.7/json/init.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, kw) 295 parse_float=parse_float, parse_int=parse_int, --> 296 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw) 297
/anaconda/envs/nlp/lib/python3.7/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 347 parse_constant is None and object_pairs_hook is None and not kw): --> 348 return _default_decoder.decode(s) 349 if cls is None:
/anaconda/envs/nlp/lib/python3.7/json/decoder.py in decode(self, s, _w) 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end()
/anaconda/envs/nlp/lib/python3.7/json/decoder.py in raw_decode(self, s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
During handling of the above exception, another exception occurred:
ArrowInvalid Traceback (most recent call last) /anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1859 time = time.time() -> 1860 for , table in generator: 1861 if max_shard_size is not None and writer._num_bytes > max_shard_size:
/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files) 154 logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}") --> 155 raise e 156 # If possible, parse the file as a list of json objects and exit the loop
/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files) 131 pa_table = paj.read_json( --> 132 io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size) 133 )
/anaconda/envs/nlp/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json()
/anaconda/envs/nlp/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()
/anaconda/envs/nlp/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()
ArrowInvalid: JSON parse error: Invalid value. in row 0
The above exception was the direct cause of the following exception:
DatasetGenerationError Traceback (most recent call last) /tmp/ipykernel_47840/842076803.py in
----> 1 ds = datasets.load_dataset("json", data_files=["unified_flan.jsonl.gz"], cache_dir="/mnt/huggingface_cache/") /anaconda/envs/nlp/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, **config_kwargs) 1785 verification_mode=verification_mode, 1786 try_from_hf_gcs=try_from_hf_gcs, -> 1787 num_proc=num_proc, 1788 ) 1789
/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, download_and_prepare_kwargs) 874 verification_mode=verification_mode, 875 prepare_split_kwargs, --> 876 **download_and_prepare_kwargs, 877 ) 878 # Sync info
/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, prepare_split_kwargs) 965 try: 966 # Prepare split will record examples associated to the split --> 967 self._prepare_split(split_generator, prepare_split_kwargs) 968 except OSError as e: 969 raise OSError(
/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size) 1748 with pbar: 1749 for job_id, done, content in self._prepare_split_single( -> 1750 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args 1751 ): 1752 if done:
/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1890 if isinstance(e, SchemaInferenceError) and e.context is not None: 1891 e = e.context -> 1892 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1893 1894 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)
DatasetGenerationError: An error occurred while generating the dataset
Hello, I tried to load OIG dataset via:
oig_data = load_dataset("laion/OIG")
however, the code shows error while loading the data:I guess maybe some columns does not match among other datasets. Does anyone encounter this error? and do you know how to overcome this error? Thank you