Data issues - Githubissues

zcczhang commented 1 year ago

Hi, thanks for the amazing work and released MIMIC-IT! seems there're a few issues:

for LLaVA-In-Context, seems meta link here is missing, where I assume it's supposed to be the LAxxx_train.json files? maybe there're misunderstandings, and it seems to me that here does not exclude the LAxx_INS prefix (e.g. cur_image_id.split('_')[-1] for LACONV, LACR_I2I, etc), otherwise LAxx_INS_ prefix is unexpectedly included for reading coco images. and there're some cases that have the key like coco/train2017/000000033471_2.jpg, where no _2 img found?
for TV caption, in TVC_instructions.json, seems the image ids do not correspond with the ids in converted TVC.json. There are some repetitive patterns, e.g. TVC_IMG_castle_s07e09_seg02_clip_02_castle_s07e09_seg02_clip_02_00009 or TVC_IMG_s04e13_seg01_clip_00_bbt_s04e13_seg01 such that it requires to rekey by r'(TVC_IMG)_(.+?_clip_[0-9]+)_(.+?_clip_[0-9]+)_([0-9]+)' for both cases
for spot difference, probably [:5] here is unexpected, otherwise only 5 examples are used?
typo here, seems to be video.VisualStoryTelling

For other datasets, it would be great to release the processed x.json file (I noticed the egg version would be coming soon) as some datasets are too old to acquire/process and some video datasets are large. Thank you!

Luodian commented 1 year ago

Thanks for bringing up these issues.

It seems related to convert-it process right? Current convert-it can not generate correct image_ids corresponding to those ids in xx_instructions.json and `xx_train.json.

We first converted our xx.json for internal use, and then back to wrote the "convert-it" to assist users to obtain xx.json from public datasets. However, it seems there might be some potential issues with the IDs during this conversion process. We are currently investigating the matter and appreciate your patience while we address the problem.

updates:

meta link of LLaVA-In-Context is updating: meta

zcczhang commented 1 year ago

saw the pr above, just wonder if coco general difference train and instruction json files are available. Thanks!

zcczhang commented 1 year ago

Hi @Luodian , just wondering when SD (COCO general diffference version) instructions and train configs would be ready in one drive folder?

Luodian commented 1 year ago

Hi @Luodian , just wondering when SD (COCO general diffference version) instructions and train configs would be ready in one drive folder?

Hi sorry I didnt see the message last week. The files are already in our side. We may wait @king159 J to do a final check then expectedly release it today.

zcczhang commented 1 year ago

Thanks for the quick response!

Luodian commented 1 year ago

@pufanyi @king159

zcczhang commented 1 year ago

Please let me know when it's ready (and maybe also the E4D egg) for my download!

Luodian commented 1 year ago

Please let me know when it's ready (and maybe also the E4D egg) for my download!

Hi COCO Difference instructions/train json have been uploaded and raw image json is uploading now~

zcczhang commented 1 year ago

That sounds great thanks! I think I have the image JSON file processed before. Btw will the egg for E4D be available? or is it too large to upload? (another minor btw: I'm not super familiar with one-drive but are there any better suggestions to directly download from the link to the headless server?)

Luodian / Otter

Data issues #172