mismatch the size of datasets

Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

https://otter-ntu.github.io/

MIT License

3.52k stars 241 forks source link

mismatch the size of datasets #318

Closed LinB203 closed 7 months ago

LinB203 commented 7 months ago

As the mentioned in paper, the MIMIC-IT dataset has 2.2M instruction qa. But I have downloaded all x_instruction.json from Hugging Face. The total number of instruction qa is 1171k. Anything I miss?

VST 32k image-qa LA 256k image-qa SN 6k image-qa SD 16k image-qa CGD 141k image-qa E4D 527k video-qa DC 56k video-qa TVC 137k video-qa

In a word, 451k image-qa & 720k video-qa, which 1171k qa totally.

Luodian commented 7 months ago

The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.

LinB203 commented 7 months ago

The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.

Thanks for your patient reply.

LinB203 commented 7 months ago

That would be much appreciated if it could be uploaded to hugging face. the OneDrive link is too unstable.

The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.

Luodian commented 7 months ago

That would be much appreciated if it could be uploaded to hugging face. the OneDrive link is too unstable.

The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.

Sure let me do so. But uploading to HF is also troublesome, sometimes the git process just failed when file exceeds 100G...

LinB203 commented 7 months ago

That would be much appreciated if it could be uploaded to hugging face. the OneDrive link is too unstable.

The E4D size is incorrect. I think it's because We have four parts and we may only upload the first part. Let me prepare upload the rest parts accordingly. BUT it may take fews days since they are pretty large.

Sure let me do so. But uploading to HF is also troublesome, sometimes the git process just failed when file exceeds 100G...

I think splitting multiple parts like DC does is a good solution! Thanks again for your generous contribution.

pufanyi commented 7 months ago

Hello! We are now uploading DC and E4D to the huggingface dataset. Expected to be completed within 24 hours. And you can check this code to learn how to load the dataset (instruction.json + images.parquet) and check the integrity of the whole dataset. Thank you!

LinB203 commented 7 months ago

That's awesome!

Luodian commented 7 months ago

E4D is updated here: https://huggingface.co/datasets/pufanyi/MIMICIT/tree/main/data/E4D

LinB203 commented 7 months ago

Thanks for your great contribution.

Li-Qingyun commented 6 months ago

It seems that VST is also mismatched, although VST parquets were updated 7 days ago.

Luodian commented 6 months ago

It seems that VST is also mismatched, although VST parquets were updated 7 days ago.

oh? Doesn't it have 32,893 instruction pairs?

Li-Qingyun commented 6 months ago

It seems that VST is also mismatched, although VST parquets were updated 7 days ago.

oh? Doesn't it have 32,893 instruction pairs?

I use this code to valid. The parquet in VST is lacked.

Li-Qingyun commented 6 months ago

It seems that VST is also mismatched, although VST parquets were updated 7 days ago.

oh? Doesn't it have 32,893 instruction pairs?

indeed 32893 samples, but 9027 missing_image_ids

Li-Qingyun commented 6 months ago

By the way, some size seems mismatched with those in the paper, which is inconvenient to check the integrity of the whole dataset.

Luodian commented 6 months ago

The size mismatch may come from we iteratively cleaned the dataset after submission. We will update the paper later when numbers are fully confirmed.

Let me check the VST's missing image_ids then.

Li-Qingyun commented 6 months ago

The size mismatch may come from we iteratively cleaned the dataset after submission. We will update the paper later when numbers are fully confirmed.

Let me check the VST's missing image_ids then.

okk thank you very much

I'll also check my downloaded files again!

I've checked E4D CGD SD SN TVC. And I'm re-downloading VST and LA to check again. LA is failed to load.

Li-Qingyun commented 6 months ago

@Luodian Hi, LA parquet loadding meets error. I've downloaded for twice.

LinB203 commented 6 months ago

尽管 VST parquets 已于 7 天前更新，但似乎 VST 也不匹配。

哦？它不是有 32,893 个指令对吗？

确实有 32893 个样本，但有 9027 个 Missing_image_ids

I've encountered missing files as well, but as far as I can remember there seems to be only a little bit (maybe a few thousand), so I've ignored it.

Luodian commented 6 months ago

@Luodian Hi, LA parquet loadding meets error. I've downloaded for twice.

Please take a look at this issue, iteratively loading would address it.

Li-Qingyun commented 6 months ago

which issue 555

Li-Qingyun commented 6 months ago

I used this

    def load_parquet_file(file_path_list):
        image_dfs = []
        for file_path in file_path_list:
            logger.info(f"Loading parquet file: {file_path}")
            image_df = dd.read_parquet(file_path, engine="pyarrow").compute()
            image_dfs.append(image_df)
        logger.info(f"Concating parquet files: {file_path_list}")
        image_df = pd.concat(image_dfs)
        return image_df

Luodian commented 6 months ago

This: https://github.com/Luodian/Otter/issues/320

Li-Qingyun commented 6 months ago

This: #320

okkkk, i see

Li-Qingyun commented 6 months ago

@Luodian Thanks very very very very much, LA is okay. Now only VST lacks some samples.

Again, thanks for your brilliant work and repo and your help, which teaches me a lot!