Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.
https://otter-ntu.github.io/
MIT License
3.56k stars 241 forks source link

CGD_instructions references images that aren't there in the parquets #329

Open StrangeTcy opened 8 months ago

StrangeTcy commented 8 months ago

https://huggingface.co/datasets/pufanyi/MIMICIT/tree/main/data/CGD/ CGD_instructions.json has lines with references like "image_ids": ["CGD_IMG_0000000000069568", "CGD_IMG_000000328270"], "rel_ins_ids": ["CGD_INS_000001"], but https://huggingface.co/datasets/pufanyi/MIMICIT/tree/main/data/CGD/CGD_frames.parquet apparently doesn't have any CGD_INS_ rows. Surely there are newer folders like https://huggingface.co/datasets/pufanyi/MIMICIT/tree/main/CGD, but these only have parquets and no jsons with instructions.

Luodian commented 8 months ago

We did change the structure of HF's dataset into parquets (inside it, it's a Dataframe object) as suggested by HF staff. So this dataset would be visible to viewers.

It should be accessed via

from datasets import load_dataset
data = load_dataset("pufanyi/MIMICIT", "CGD", use_auth_token=True)

@pufanyi Would you mind to address this question with more details?