Question about choosing multi-image input mode and replacing image decoder

charlierabea commented 11 months ago

          Thank you for your interest.

Accomplishing a multi-image response with a single instruction can be easily done by adhering to the dataset format found here: https://github.com/Luodian/Otter/blob/9b34a4467581869c67dae7ea2b970f8e6b201d3c/pipeline/mimicit_utils/mimicit_dataset.py#L432

To achieve this, you may follow these steps:

Format your data following the guidelines provided here: https://github.com/Luodian/Otter/tree/main/mimic-it. Assume the prefix of your instruction id is "MED", like so:

"MED_INS_00001": {
        "instruction":"XXX",
        "answer":"XXX.",
        "image_ids":["XXX",",..."], # The multi-images corresponding to this instruction
        "rel_ins_ids":[], # This value can be []. If you have a multi-round conversation, it should be filled with the instruction ids of the other rounds.
    },

Modify this line from:
```
elif cur_train_id.startswith("SD"): 
```
to:
```
elif cur_train_id.startswith("SD") or cur_train_id.startswith("MED"): 
```
This is because your instruction uses the same data format (multi-image, one conversation) as the "Spot-the-difference" data.
Begin tuning your data with Otter by altering your specific instruction/image/train configuration from:
```
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
```
to:
```
--mimicit_vt_path="path/to/MED_instruction.json" \
--images_vt_path="path/to/MED.json" \
```
If you have any further inquiries, don't hesitate to reach out via email. We can also add you to our Slack community for more immediate communication.

Originally posted by @ZhangYuanhan-AI in https://github.com/Luodian/Otter/issues/234#issuecomment-1665564520

I was delighted to stumble upon this remarkable project. Thank you for your valuable contribution.

I am now doing a medical image(with multi slices and one description for each patient) captioning task. According to the above comment, I formed the training data MED.json and MED_instruction. Here's how the instruction json looks like: { "meta": { "version": "", "time": "", "author": "" }, "data": { "test_INS_00000": { "instruction": "", "answer": ".\n ", "image_ids": [ "MED_IMG_1", "MED_IMG_2", "MED_IMG_3", "MED_IMG_4", "MED_IMG_5", "MED_IMG_6", "MED_IMG_7", "MED_IMG_8", "MED_IMG_9", "MED_IMG_10", "MED_IMG_11", "MED_IMG_12", "MED_IMG_13", "MED_IMG_14", "MED_IMG_15", "MED_IMG_16", "MED_IMG_17", "MED_IMG_18", "MED_IMG_19", "MED_IMG_20", "MED_IMG_21", "MED_IMG_22", "MED_IMG_23", "MED_IMG_24" ], "rel_ins_ids": [] }, ..... }

The version of Otter I'm using is the 8/17 commit, and I've successfully got the generated caption and evaluated them with BLEU and CIDEr. However, I accidentally discovered that using the VQA mode has on par performance compared to SD mode, and different instruction is resulting in more diverse performance. Does that mean the SD mode doesn't suit my training scenerio, and VQA mode can help me test my instructions?

Furthermore, I'm trying to use the BiomedCLIP image decoder like the LLaVA-Med paper did. However, the 0817 instruction_following.py had no customized_config statement, and adding customized_config statements on the instruction_following.py from the 0830 commit does nothing. The resulting checkpoint config still writes CLIP.

Here's the config.json I created as the 0830 commit suggested. { "model_type": "otter", "cross_attn_every_n_layers": 4, "tie_word_embeddings": false, "use_media_placement_augmentation": true, "only_attend_previous": true, "text_config": { "_name_or_path": "luodian/llama-7b-hf", "model_type": "llama" }, "vision_config": { "_name_or_path": "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224", "model_type": "clip_vision_model", "hidden_size": 768, "intermediate_size": 3072, "num_attention_heads": 12, "num_hidden_layers": 12, "image_size": 224, "patch_size": 16 } }

Looking forward to exploring this topic and citing you and your colleagues on any possible publication!

ZhangYuanhan-AI commented 10 months ago

Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions? In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.

charlierabea commented 10 months ago

Does that mean the SD mode doesn't suit my training scenario, and VQA mode can help me test my instructions? In your case, one instruction pairing with multiple images, we recommend to use SD mode. Though the achieved performance based on SD mode or VQA mode might be same in your case, the SD mode is logically reasonable in your data construction scenario.

Thank you so much for your reply. We'll continue on our SD experiment. Regarding the vision encoder, do you have any solution to replacing it?

ZhangYuanhan-AI commented 10 months ago

Maybe one solution is injecting the parameter of "microsoft/BiomedCLIP-PubMedBERT_256-vit_base_patch16_224" into Otter checkpoint

Luodian / Otter

Question about choosing multi-image input mode and replacing image decoder #279