LAION-AI / Open-Instruction-Generalist

Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks
Apache License 2.0
206 stars 19 forks source link

huggingface dataset load OIG failed due to data column issue #9

Closed sufengniu closed 1 year ago

sufengniu commented 1 year ago

Hello, I tried to load OIG dataset via: oig_data = load_dataset("laion/OIG") however, the code shows error while loading the data:

ValueError: Couldn't cast                                                                                                                                                                                                                                    
text: string                                                                                                                                                                                                                                                 
to                                                                                                                                                                                                                                                           
{'text': Value(dtype='string', id=None), 'meta': {'source': Value(dtype='string', id=None)}}                                                                                                                                                                 
because column names don't match 

I guess maybe some columns does not match among other datasets. Does anyone encounter this error? and do you know how to overcome this error? Thank you

sufengniu commented 1 year ago

I found I incorrectly use the huggingface dataset command. Following is worked version, that first download the data into local and then oig_data = load_dataset("json", data_files="your/data/dir")

sufengniu commented 1 year ago

I would like re-open the issue since I found even the above way would eventually got the error:

ValueError: Couldn't cast
text: string
metadata: struct<source: string>
  child 0, source: string
to
{'text': Value(dtype='string', id=None)}
because column names don't match
ari9dam commented 1 year ago

I'm facing the same issue. Did you figure out?

ari9dam commented 1 year ago

This works:

ds = datasets.load_dataset("laion/OIG", use_auth_token=True, streaming=True)
for x in ds:
  print(x)
  break
huu4ontocord commented 1 year ago

Hi all - I am pushing new versions of the datasets with the metadata field. Also files should have both a text and metadata field. If there are still problems please let me know, and in particular which files. As an aside, if you load using HF datasets, it might try to load all of the jsonl files, which you may not want. Instead, you may wish to download individual files and load via the json method.

ari9dam commented 1 year ago

The issue still persists. I tried to download unified_flan.jsonl.gz: It seems the data is not in .gz format. datasets.load_dataset("json", data_files=["unified_flan.jsonl.gz"]) fails DatasetGenerationError: An error occurred while generating the dataset guzip unified_flan.jsonl.gz compains data is not in .gz format.

file unified_flan.jsonl.gz outputs unified_flan.jsonl.gz HTML document, UTF-8 Unicode text, with very long lines how to extract?

sufengniu commented 1 year ago

I still have similar error, but after I use new HF datasets, the error changed to be:

Generating train split: 4046401 examples [03:53, 4568.92 examples/s]Failed to read file '/scratch1/jiajinn/cache/huggingface/datasets/downloads/93711c64efd043b64d8ef390c31794f36218c1678bb107e8c7e6057b123d2ef9' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Missing a closing quotation mark in string. in row 37

ArrowInvalid: JSON parse error: Missing a closing quotation mark in string. in row 37

The above exception was the direct cause of the following exception:

DatasetGenerationError                    Traceback (most recent call last)

my guess is something might be wrong in some data column?

ari9dam commented 1 year ago

I am also using the latest HF. I downloaded only the unified_flan.jsonl.gz with wget and was loading that. ds = datasets.load_dataset("json", data_files=["unified_flan.jsonl.gz"], cache_dir="/mnt/huggingface_cache/") The complete error I got was

Failed to read file 'unified_flan.jsonl.gz' with error <class 'pyarrow.lib.ArrowInvalid'>: JSON parse error: Invalid value. in row 0

JSONDecodeError Traceback (most recent call last) /anaconda/envs/nlp/lib/python3.7/site-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files) 151 with open(file, encoding="utf-8") as f: --> 152 dataset = json.load(f) 153 except json.JSONDecodeError:

/anaconda/envs/nlp/lib/python3.7/json/init.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, kw) 295 parse_float=parse_float, parse_int=parse_int, --> 296 parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw) 297

/anaconda/envs/nlp/lib/python3.7/json/init.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw) 347 parse_constant is None and object_pairs_hook is None and not kw): --> 348 return _default_decoder.decode(s) 349 if cls is None:

/anaconda/envs/nlp/lib/python3.7/json/decoder.py in decode(self, s, _w) 336 """ --> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 338 end = _w(s, end).end()

/anaconda/envs/nlp/lib/python3.7/json/decoder.py in raw_decode(self, s, idx) 354 except StopIteration as err: --> 355 raise JSONDecodeError("Expecting value", s, err.value) from None 356 return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

ArrowInvalid Traceback (most recent call last) /anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1859 time = time.time() -> 1860 for , table in generator: 1861 if max_shard_size is not None and writer._num_bytes > max_shard_size:

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files) 154 logger.error(f"Failed to read file '{file}' with error {type(e)}: {e}") --> 155 raise e 156 # If possible, parse the file as a list of json objects and exit the loop

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/packaged_modules/json/json.py in _generate_tables(self, files) 131 pa_table = paj.read_json( --> 132 io.BytesIO(batch), read_options=paj.ReadOptions(block_size=block_size) 133 )

/anaconda/envs/nlp/lib/python3.7/site-packages/pyarrow/_json.pyx in pyarrow._json.read_json()

/anaconda/envs/nlp/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.pyarrow_internal_check_status()

/anaconda/envs/nlp/lib/python3.7/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

ArrowInvalid: JSON parse error: Invalid value. in row 0

The above exception was the direct cause of the following exception:

DatasetGenerationError Traceback (most recent call last) /tmp/ipykernel_47840/842076803.py in ----> 1 ds = datasets.load_dataset("json", data_files=["unified_flan.jsonl.gz"], cache_dir="/mnt/huggingface_cache/")

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/load.py in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, use_auth_token, task, streaming, num_proc, **config_kwargs) 1785 verification_mode=verification_mode, 1786 try_from_hf_gcs=try_from_hf_gcs, -> 1787 num_proc=num_proc, 1788 ) 1789

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in download_and_prepare(self, output_dir, download_config, download_mode, verification_mode, ignore_verifications, try_from_hf_gcs, dl_manager, base_path, use_auth_token, file_format, max_shard_size, num_proc, storage_options, download_and_prepare_kwargs) 874 verification_mode=verification_mode, 875 prepare_split_kwargs, --> 876 **download_and_prepare_kwargs, 877 ) 878 # Sync info

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _download_and_prepare(self, dl_manager, verification_mode, prepare_split_kwargs) 965 try: 966 # Prepare split will record examples associated to the split --> 967 self._prepare_split(split_generator, prepare_split_kwargs) 968 except OSError as e: 969 raise OSError(

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _prepare_split(self, split_generator, file_format, num_proc, max_shard_size) 1748 with pbar: 1749 for job_id, done, content in self._prepare_split_single( -> 1750 gen_kwargs=gen_kwargs, job_id=job_id, **_prepare_split_args 1751 ): 1752 if done:

/anaconda/envs/nlp/lib/python3.7/site-packages/datasets/builder.py in _prepare_split_single(self, gen_kwargs, fpath, file_format, max_shard_size, job_id) 1890 if isinstance(e, SchemaInferenceError) and e.context is not None: 1891 e = e.context -> 1892 raise DatasetGenerationError("An error occurred while generating the dataset") from e 1893 1894 yield job_id, True, (total_num_examples, total_num_bytes, writer._features, num_shards, shard_lengths)

DatasetGenerationError: An error occurred while generating the dataset