dvlab-research / MGM

Official repo for "Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models"
Apache License 2.0
3.08k stars 275 forks source link

Pretrain data not found in AllaVA #20

Closed lucasjinreal closed 3 months ago

lucasjinreal commented 3 months ago

Hi, the pretrained data used allava images both from laion nad vfan.

But the laion part image names are totally different from ALLava's images format.

I tried to found:

465440.jpeg
320609

they all used in minigemini_pretrain.json but can not be found in ALLava images folder.

ls -f images | grep 465440
46544031.jpeg
(base) ➜  allava_laion git:(main) ✗ ls -f images | grep 320609     
132060956.jpeg
43206091.jpeg

why is that?

lucasjinreal commented 3 months ago

How to mapping this image file really?

image

daicver commented 3 months ago

I've also noticed this issue. It seems that after the existing allava was updated, some pictures were deleted. Can the repository owner take a look at the latest allava dataset and provide relevant information? @yanwei-li

lucasjinreal commented 3 months ago

It's not deleted, the index number doesn't really match the ALLava's images, besides, I think the minigemini data provided image name are trucked, where is jpeg suffix? This really shouldn't happen for them, make users very confused and missleading..

daicver commented 3 months ago

I try to use the url of minigemini pretrain dataset to match the url of allava dataset, minigemini pretrain dataset is missing about 6939 images.

lucasjinreal commented 3 months ago

Missing is normal, the question is, we have to using url to match the correct image id (filename) ?

this is rediculous.

g-h-chen commented 3 months ago

@daicver @lucasjinreal Hi, thanks for using the ALLaVA data. I am from the ALLaVA group. We did have a silent update soon after we release the data. And it seems that the Mini-Gemini project was using the data before the update. In the original version, the images entry looked like allava_laion/allava_laion_512763 and all image filenames were without a suffix either, which means they are mapped correctly. In the current version, we made simultaneous adjustments to annotations and image filenames with suffix added. We will fix this issue soon with the Mini-Gemini team. Stay tuned!

lucasjinreal commented 3 months ago

@g-h-chen Oh, didnt notice that am download actually an updated version.

So, looks like minigemini were using older data, Just wonder, does the gpt repsonse also changed? Is the newer data is a super collection to older one?

If so, we can just mapping the name with url propabaly?

g-h-chen commented 3 months ago

@lucasjinreal

So, looks like minigemini were using older data, Just wonder, does the gpt repsonse also changed?

NO change in GPT-4V response.

Is the newer data is a super collection to older one?

No. In short, we only add postfixes to image filenames and annotation files so that one can preview it easily.

we can just mapping the name with url propabaly?

Sure you can do so, but we have uploaded the images in our repo as well which saves some effort for you.

lucasjinreal commented 3 months ago

Yes, I downloaded the images,

lucasjinreal commented 3 months ago

Oh, I found using url to mapping ,still get some file unable to map correclty. Any solution?

For example:

/ALLaVA-4V/allava_laion/images/281029 not found

/allava_laion/images/387904
yanwei-li commented 3 months ago

Hi, we @g-h-chen are working together to align the data and will update the data file soon, please stay tuned.

lucasjinreal commented 3 months ago

Looks like the truth is not exaclty said as @g-h-chen , the new Allava actually delete some images which minigemini used. Which I don't know why.

yanwei-li commented 3 months ago

Hi, @lucasjinreal @daicver we have updated the ALLaVA data in our files, please download them in the original link Mini-Gemini-Pretrain and Mini-Gemini-Instruction. We will also re-train our model to find the effect of data change.

daicver commented 3 months ago

ok, thanks

g-h-chen commented 3 months ago

Dudes, (hopefully) a final comment here:

  1. The id entry of each item for allava_laion (caption and instruction) is not unique. The number of samples is 505588, and the number of unique ids is 484532. However, the contents are not the same for the samples sharing the same id. The reason is that when ALLaVA project started, we tried out some prompting strategies which led to our final version, but we accidentally re-generated those samples when performing large-scale distillation. However, we kept those samples anyway considering the cost.

  2. Mini-Gemini team and us have updated the aligned data. Sorry for the inconvenience caused!

lucasjinreal commented 3 months ago

thank u for your guys immediate response. Closing as it was solved.