Adding notebook for Llava-OneVision on multi-image task

nicokossmann commented 1 month ago

Hey @zucchini-nlp and @NielsRogge👋,

I created a notebook for fine-tuning Llava-OneVision-0.5b-ov-hf on the BLINK Benchmark, based on the notebook of LLAVA-NeXT. This notebook could be helpful for other folks to get an introduction to multi-image tasks with Llava-OneVision. During the implementation, a few questions arose:

How to pass images of size 384x384 only once and not additionally the global patch (especially helpful with multiple images)?
Why do we need the input type f32 instead of f16 when we load the trained model?
And last but not least, do you have any tips on how to reduce the size of the input_ids (I saw there are some interesting parameters like vision_feature_select_strategy or num_image_tokens)?

zucchini-nlp commented 1 month ago

Hey @nicokossmann !

Great, the training should be very similar to llava-next yes. You can also use this library (https://github.com/zjysteven/lmms-finetune) for fine-tuning VLMs. Regarding the questions:

Hmm, I don't think we have an option to disable patching and just pass the base image. Now that I think about it, probably we should support that for multi-image as the paper mentioned something similar to training on base image only. Lemme check if it is possible to easily integrate that or not
I am not sure I get it, in the notebook the model is loaded either in 4-bits or fp16 with FA2. You don't have to load the model in full precision to finetune, neither for inference
For LLaVA-OneVision you can either try pooling, which is already used for video inputs and I'll see how to enable passing only the base image. There are more methods for reducing number of tokens in other models/papers, but those are not available through transformers implementation and would require you to overwrite the forward pass. For ex PixelShuffle is a common method or there is also https://github.com/bfshi/scaling_on_scales. num_image_tokens does not reduce anything and should reflect the actual number of tokens each image will take after ViT backbone. That is used only to add placeholder tokens which are later replaced with actual image embeddings. And vision_feature_select_strategy can help to reduce token count by 1 if you indicate default, that is the case when we remove CLS token from image embeddings

nicokossmann commented 1 month ago

@zucchini-nlp Thanks for your quick response.

Your feedback on the questions was extremely helpful.

With regard to the second question, I orientated myself on the provided notebook. We load the base model with the corresponding adapters for inference.

# Load the base model with adapters on top
model =  LlavaOnevisionForConditionalGeneration.from_pretrained(
    "nicokossmann/Llava-OneVision-blink",
    torch_dtype=torch.float16,
    # torch_dtype=torch.float32,
    quantization_config=quantization_config,
)

However, if I use fp16, I get the error:

Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

I also noticed that I made a spelling mistake, which means that I can no longer train the model because the input_ids has grown to a size of torch.Size([1, 4545]) for 3 images an , which means that I can no longer train it on my current GPU. Therefore the implementation of the base image is even more important.

zucchini-nlp commented 1 month ago

Oh i see, the message is saying your inputs are in fp32 and you prob have to manually casr ipnut to fp16 in data collation/preparation step as inputs.to(torch.float16)

For the base image, noted and I'll add it to my TODO list. If you want to give it a try yourself, please feel free to open a PR and tag me 😄

nicokossmann commented 1 month ago

I believe this is a common issue with the base image in many models 😅

I am currently working with the Phi-3.5-vision-instruct model and have encountered the same issue. Despite being able to set the number of crops via a parameter, I consistently receive pixel_values of shape (4, 2, 3, 336, 336) for four images with size of 336x336 (base image).

zucchini-nlp commented 1 month ago

@nicokossmann I would day it depends on whether the model should be supporting base image only setting, because some models like llava-next are never tuned with only one image. If you want to tune Llava with more freedom for different parameters, I'd recommend to use the official repo (LLaVA-VL) which allows setting any combination of params. Later it can be converted to HF format for inference :)

For Phi-3.5, if you believe the model should support base image only, feel free to open a discussion on the hub. Since the model is trust_remote_code, it is maintained by Microsoft and not our team

nicokossmann commented 3 weeks ago

@zucchini-nlp,

I tried to fix the problem with the base image support, but I've got stuck on an error message that I can't solve

  File "/opt/conda/envs/llava/lib/python3.11/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/llava/lib/python3.11/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 632, in forward
    inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

I have two base images (384, 384) for the model and get input_ids of shape (1, 1541) and pixel_values of shape (2, 1, 3 384, 384). From the input_ids are 1512 default image ids. I have debugged the code and got the shapes (1, 1541, 896) for the input_embeds, image_feature and for the special_image_mask (2, 729, 896).

Do you have any idea what the error could be?

NielsRogge / Transformers-Tutorials

Adding notebook for Llava-OneVision on multi-image task #470