HaozheZhao / MIC

MMICL, a state-of-the-art VLM with the in context learning ability from ICL, PKU
335 stars 15 forks source link

failed in datasets.load_dataset #14

Open kq-chen opened 1 year ago

kq-chen commented 1 year ago

Thank you for your excellent work! I'd like to express my gratitude for your efforts in contributing to open-source data and models. I encountered a minor issue when loading a dataset from Hugging Face, and when I used the following code

import datasets

datasets.logging.set_verbosity(datasets.logging.INFO)
ds = datasets.load_dataset('BleachNick/MIC_full')

I received the following error message:

Resolving data files: 100%|██████████| 19/19 [00:00<00:00, 42.63it/s]
Using custom data configuration default-125ac711f175b504
Loading Dataset Infos from D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\packaged_modules\json
Generating dataset mic_full (C:/Users/ckq/.cache/huggingface/datasets/BleachNick___mic_full/default-125ac711f175b504/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96)
Downloading and preparing dataset mic_full/default to C:/Users/ckq/.cache/huggingface/datasets/BleachNick___mic_full/default-125ac711f175b504/0.0.0/8bb11242116d547c741b2e8a1f18598ffdd40a1d4f2a2872c7a28b697434bc96...
Dataset not on Hf google storage. Downloading and preparing it from source
Downloading data files: 100%|██████████| 3/3 [00:00<00:00, 35.29it/s]
Downloading took 0.0 min
Checksum Computation took 0.0 min
Extracting data files: 100%|██████████| 3/3 [00:00<00:00, 35.94it/s]
Generating train split
Generating train split: 617211 examples [00:01, 370867.67 examples/s]
Traceback (most recent call last):
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\builder.py", line 1940, in _prepare_split_single
    writer.write_table(table)
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\arrow_writer.py", line 572, in write_table
    pa_table = table_cast(pa_table, self._schema)
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\table.py", line 2328, in table_cast
    return cast_table_to_schema(table, schema)
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\table.py", line 2287, in cast_table_to_schema
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\table.py", line 2287, in <listcomp>
    arrays = [cast_array_to_feature(table[name], feature) for name, feature in features.items()]
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\table.py", line 1831, in wrapper
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\table.py", line 1831, in <listcomp>
    return pa.chunked_array([func(chunk, *args, **kwargs) for chunk in array.chunks])
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\table.py", line 2144, in cast_array_to_feature
    raise TypeError(f"Couldn't cast array of type\n{array.type}\nto\n{feature}")
TypeError: Couldn't cast array of type
string
to
Sequence(feature=Value(dtype='string', id=None), length=-1, id=None)
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\IPython\core\interactiveshell.py", line 3505, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-2-cb0601f63add>", line 1, in <module>
    runfile('D:\\home\\code\\minimal-working-example\\src\\mwe_datasets\\main_mic_full.py', wdir='D:\\home\\code\\minimal-working-example\\src\\mwe_datasets')
  File "D:\Program Files\JetBrains\PyCharm 2023.2.1\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
    pydev_imports.execfile(filename, global_vars, local_vars)  # execute the script
  File "D:\Program Files\JetBrains\PyCharm 2023.2.1\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "D:\home\code\minimal-working-example\src\mwe_datasets\main_mic_full.py", line 4, in <module>
    ds = datasets.load_dataset('BleachNick/MIC_full')
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\load.py", line 2153, in load_dataset
    builder_instance.download_and_prepare(
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\builder.py", line 954, in download_and_prepare
    self._download_and_prepare(
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\builder.py", line 1049, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\builder.py", line 1813, in _prepare_split
    for job_id, done, content in self._prepare_split_single(
  File "D:\ProgramData\anaconda3\envs\meft\lib\site-packages\datasets\builder.py", line 1958, in _prepare_split_single
    raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

Is there any way to resolve this issue or get more information on how to handle it? Your assistance would be greatly appreciated. Thank you!

kq-chen commented 1 year ago

I suspect it may be related to the format of certain JSONL files. It seems that input_image is expected to be a list of strings, but in some annotation files, it is marked as a single string.

some code to check the annotation files

# download the dataset first by huggingface_hub.snapshot_download('BleachNick/MIC_full', repo_type='dataset')
from glob import glob
import os.path as osp
from pathlib import Path
import json

data_jsonl_root = Path(r'datasets--BleachNick--MIC_full/snapshots/499162c4f0a3f919f0a417918d71aab51280db84/data_jsonl')
for file in glob(str(Path(data_jsonl_root) / r'**/*.jsonl'), recursive=True):
    for line in open(file, 'r'):
        obj = json.loads(line)
        if not isinstance(obj['input_image'], list):
            print(f"{osp.relpath(file, data_jsonl_root)} input_image: {obj['input_image']}")
        break

and I got this

video_captioning/msrvtt/test.jsonl input_image: ./data/msrvtt/TestVideo/video7960.mp4
video_captioning/msrvtt/train.jsonl input_image: ./data/msrvtt/TrainValVideo/video6315.mp4
video_captioning/msrvtt/val.jsonl input_image: ./data/msrvtt/TrainValVideo/video6968.mp4
video_qa/ivqa/test.jsonl input_image: ./data/ivqa/howto100mqa_nointersec_uniform_sampled/DlUvSkgMaLY_46_60.webm
video_qa/ivqa/train.jsonl input_image: ./data/ivqa/howto100mqa_nointersec_uniform_sampled/Pbwim2GdyNg_187_213.webm
video_qa/ivqa/val.jsonl input_image: ./data/ivqa/howto100mqa_nointersec_uniform_sampled/S5LTrh8v0N4_402_422.webm
video_qa/msrvttqa/test.jsonl input_image: ./data/msrvtt/TestVideo/video7010.mp4
video_qa/msrvttqa/train.jsonl input_image: ./data/msrvtt/TrainValVideo/video4321.mp4
video_qa/msrvttqa/val.jsonl input_image: ./data/msrvtt/TrainValVideo/video6513.mp4
video_qa/mvsd/test.jsonl input_image: ./data/mvsd/video/jfrrO5K_vKM_55_65.avi
video_qa/mvsd/train.jsonl input_image: ./data/mvsd/video/4PcL6-mjRNk_11_18.avi
video_qa/mvsd/val.jsonl input_image: ./data/mvsd/video/bQJQGoJF7_k_162_169.avi
visual_dialog/llava/train.jsonl input_image: ./data/coco/train2014/COCO_train2014_/000000197959.jpg