TIGER-AI-Lab / Mantis

Official code for Paper "Mantis: Multi-Image Instruction Tuning"
https://tiger-ai-lab.github.io/Mantis/
Apache License 2.0
154 stars 12 forks source link

Idefics2 full fine-tuning getting RuntimeError: shape mismatch #5

Closed chris-tng closed 3 months ago

chris-tng commented 3 months ago

I'm working on fine-tuning Idefics2 with multiple images in instruction I follow this script for full fine-tuning: https://github.com/TIGER-AI-Lab/Mantis/blob/89d34077bd87b66eaadc13117add553e3a3d4c0b/mantis/train/scripts/train_idefics2_full.sh

Here is the command

NCCL_DEBUG=WARN accelerate launch --config_file=./accelerate_configs/accelerate_config_zero3.yaml \
    --machine_rank 0 --main_process_ip 10.29.35.44 --main_process_port 12956 \
    --num_machines 1 --num_processes 8 \
    train_idefics2.py \
    --model_name_or_path HuggingFaceM4/idefics2-8b \
    --data_config_file custom_data_config.yaml \
    --data_format chat \
    --run_name 240523_idefics2_mantis \
    --output_dir 240523_idefics2_mantis \
    --num_train_epochs 3 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 2 \
    --evaluation_strategy "steps" \
    --save_strategy "steps" \
    --save_steps 200 \
    --eval_steps 200 \
    --save_total_limit 5 \
    --learning_rate 2e-5 \
    --weight_decay 0.01 \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --gradient_checkpointing True \
    --dataloader_num_workers 5 \
    --report_to wandb \
    --do_train \
    --lora_enabled False \
    --qlora_enabled False \
    --dora_enabled False \
    --max_seq_len 512 \
    --fp16 \
    --attn_implementation eager

Error i got is

[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1677, in forward
[rank0]:     inputs_embeds = self.inputs_merger(
[rank0]:   File "/home/user/Mantis/mantis/models/idefics2/modeling_idefics2.py", line 1564, in inputs_merger
[rank0]:     new_inputs_embeds[special_image_token_mask] = reshaped_image_hidden_states
[rank0]: RuntimeError: shape mismatch: value tensor of shape [256, 4096] cannot be broadcast to indexing result of shape [192, 4096]

Any suggestions how to fix it?

Thanks in advance

jdf-prog commented 3 months ago

Seems like 192 = 63 4, and 256 = 64 4. Judging from the error codes, it seems like there was a missing <image> placeholder in the text. You probably need to check your input data for debugging, ensuring that the number of input <image> tokens in the text has the same number of the images.

chris-tng commented 3 months ago

Thank you for your reply @jdf-prog . It's likely the error. I have resolved it by using cpu_offload. I think truncation actually cuts off tokens leading to the error. Increasing max_seq_len will lead to OOM, so cpu_offload is needed, and it's training fine but very slow.

Sorry for my newbie question as I'm new to VLM. Do you have any rule of thumb to estimate VRAM usage for idefics2-8b with multiple images (I'm using 4 for each prompt) ?

Also, I see that max_image_size='(1080,1920)', do I need to resize images of resolution is larger than this?