Open kq-chen opened 1 year ago
I suspect it may be related to the format of certain JSONL files. It seems that input_image is expected to be a list of strings, but in some annotation files, it is marked as a single string.
some code to check the annotation files
# download the dataset first by huggingface_hub.snapshot_download('BleachNick/MIC_full', repo_type='dataset')
from glob import glob
import os.path as osp
from pathlib import Path
import json
data_jsonl_root = Path(r'datasets--BleachNick--MIC_full/snapshots/499162c4f0a3f919f0a417918d71aab51280db84/data_jsonl')
for file in glob(str(Path(data_jsonl_root) / r'**/*.jsonl'), recursive=True):
for line in open(file, 'r'):
obj = json.loads(line)
if not isinstance(obj['input_image'], list):
print(f"{osp.relpath(file, data_jsonl_root)} input_image: {obj['input_image']}")
break
and I got this
video_captioning/msrvtt/test.jsonl input_image: ./data/msrvtt/TestVideo/video7960.mp4
video_captioning/msrvtt/train.jsonl input_image: ./data/msrvtt/TrainValVideo/video6315.mp4
video_captioning/msrvtt/val.jsonl input_image: ./data/msrvtt/TrainValVideo/video6968.mp4
video_qa/ivqa/test.jsonl input_image: ./data/ivqa/howto100mqa_nointersec_uniform_sampled/DlUvSkgMaLY_46_60.webm
video_qa/ivqa/train.jsonl input_image: ./data/ivqa/howto100mqa_nointersec_uniform_sampled/Pbwim2GdyNg_187_213.webm
video_qa/ivqa/val.jsonl input_image: ./data/ivqa/howto100mqa_nointersec_uniform_sampled/S5LTrh8v0N4_402_422.webm
video_qa/msrvttqa/test.jsonl input_image: ./data/msrvtt/TestVideo/video7010.mp4
video_qa/msrvttqa/train.jsonl input_image: ./data/msrvtt/TrainValVideo/video4321.mp4
video_qa/msrvttqa/val.jsonl input_image: ./data/msrvtt/TrainValVideo/video6513.mp4
video_qa/mvsd/test.jsonl input_image: ./data/mvsd/video/jfrrO5K_vKM_55_65.avi
video_qa/mvsd/train.jsonl input_image: ./data/mvsd/video/4PcL6-mjRNk_11_18.avi
video_qa/mvsd/val.jsonl input_image: ./data/mvsd/video/bQJQGoJF7_k_162_169.avi
visual_dialog/llava/train.jsonl input_image: ./data/coco/train2014/COCO_train2014_/000000197959.jpg
Thank you for your excellent work! I'd like to express my gratitude for your efforts in contributing to open-source data and models. I encountered a minor issue when loading a dataset from Hugging Face, and when I used the following code
I received the following error message:
Is there any way to resolve this issue or get more information on how to handle it? Your assistance would be greatly appreciated. Thank you!