LAION-AI / Open-Instruction-Generalist

Open Instruction Generalist is an assistant trained on massive synthetic instructions to perform many millions of tasks
Apache License 2.0
206 stars 19 forks source link

Issue loading OIG from Huggingface Hub #11

Closed conceptofmind closed 4 months ago

conceptofmind commented 1 year ago

Hi all,

Thanks for the awesome work.

I am receiving this error when trying to load the OIG dataset from Huggingface:

    raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 2 column 1 (char 1302)

During handling of the above exception, another exception occurred:

    pa_table = paj.read_json(
  File "pyarrow/_json.pyx", line 259, in pyarrow._json.read_json
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: JSON parse error: Missing a comma or '}' after an object member. in row 10

The above exception was the direct cause of the following exception:

    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Any input would be greatly appreciated.

Thank you,

Enrico

huu4ontocord commented 1 year ago

Can you tell me which sub dataset? What was the exact code you ran?

conceptofmind commented 1 year ago

Can you tell me which sub dataset? What was the exact code you ran?

Hi,

The error did not clearly define which subset was causing the issue. From some testing peers and I did, it seems like unified_p3.jsonl.gz may be part of the issue. There are still problems even after removing that file.

The dataset is from the Huggingface hub: https://huggingface.co/datasets/laion/OIG/tree/main

The code is pretty standard:

from datasets import load_dataset
dataset = load_dataset('laion/OIG', split = 'train')

Thank you,

Enrico

conceptofmind commented 1 year ago

This error also occurs as well if you get rid of unified_p3:

    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
struct<labels: list<item: string>, source: string>
to
{'source': Value(dtype='string', id=None)}

The above exception was the direct cause of the following exception:

    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

So there may be a few different files that have issues.

huu4ontocord commented 1 year ago

Ok. thank you. will check p3 and see if we can track it down. p3 is huge... so i actually don't use load_dataset. I load it using json