WisconsinAIVision / ViP-LLaVA

[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
https://vip-llava.github.io/
Apache License 2.0
214 stars 15 forks source link

Can u provide the subset only for ViP except llava mix original? #7

Open lucasjinreal opened 4 months ago

lucasjinreal commented 4 months ago

So that it can not be easily missleading model using llava original dataset.

Meanwhile, it looks like the images are missing..

    "id": "vcr-52941",
    "image": "vcr1images/lsmdc_3034_IDES_OF_MARCH/3034_IDES_OF_MARCH_01.27.04.788-01.27.10.308@2.jpg",
    "meta_dir": "./dataset/vcr1images/lsmdc_3034_IDES_OF_MARCH/3034_IDES_OF_MARCH_01.27.04.788-01.27.10.308@2.json",
    "class_names": [
        "person",
mu-cai commented 4 months ago

Hello,

Thanks for your interest in our work!

lucasjinreal commented 4 months ago

Hi The meta is needed to prepare training images?

mu-cai commented 4 months ago

The meta data is included in the vcr_images directory. Therefore, do not worry. The metadata is there.

lucasjinreal commented 4 months ago

I wanna using official llava base. I just need to add a vipProceessor to process image right?

mu-cai commented 4 months ago

Correct. Visual prompt blending is all you need.