Open nicokossmann opened 1 month ago
Hey @nicokossmann !
Great, the training should be very similar to llava-next yes. You can also use this library (https://github.com/zjysteven/lmms-finetune) for fine-tuning VLMs. Regarding the questions:
num_image_tokens
does not reduce anything and should reflect the actual number of tokens each image will take after ViT backbone. That is used only to add placeholder tokens which are later replaced with actual image embeddings. And vision_feature_select_strategy
can help to reduce token count by 1
if you indicate default
, that is the case when we remove CLS token from image embeddings@zucchini-nlp Thanks for your quick response.
Your feedback on the questions was extremely helpful.
With regard to the second question, I orientated myself on the provided notebook. We load the base model with the corresponding adapters for inference.
# Load the base model with adapters on top
model = LlavaOnevisionForConditionalGeneration.from_pretrained(
"nicokossmann/Llava-OneVision-blink",
torch_dtype=torch.float16,
# torch_dtype=torch.float32,
quantization_config=quantization_config,
)
However, if I use fp16, I get the error:
Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same
I also noticed that I made a spelling mistake, which means that I can no longer train the model because the input_ids has grown to a size of
torch.Size([1, 4545])
for 3 images an , which means that I can no longer train it on my current GPU. Therefore the implementation of the base image is even more important.
Oh i see, the message is saying your inputs are in fp32 and you prob have to manually casr ipnut to fp16 in data collation/preparation step as inputs.to(torch.float16)
For the base image, noted and I'll add it to my TODO list. If you want to give it a try yourself, please feel free to open a PR and tag me 😄
I believe this is a common issue with the base image in many models 😅
I am currently working with the Phi-3.5-vision-instruct model and have encountered the same issue. Despite being able to set the number of crops via a parameter, I consistently receive pixel_values of shape (4, 2, 3, 336, 336)
for four images with size of 336x336 (base image).
@nicokossmann I would day it depends on whether the model should be supporting base image only setting, because some models like llava-next are never tuned with only one image. If you want to tune Llava with more freedom for different parameters, I'd recommend to use the official repo (LLaVA-VL) which allows setting any combination of params. Later it can be converted to HF format for inference :)
For Phi-3.5, if you believe the model should support base image only, feel free to open a discussion on the hub. Since the model is trust_remote_code
, it is maintained by Microsoft and not our team
@zucchini-nlp,
I tried to fix the problem with the base image support, but I've got stuck on an error message that I can't solve
File "/opt/conda/envs/llava/lib/python3.11/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/llava/lib/python3.11/site-packages/transformers/models/llava_onevision/modeling_llava_onevision.py", line 632, in forward
inputs_embeds = inputs_embeds.masked_scatter(special_image_mask, image_features)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I have two base images (384, 384)
for the model and get input_ids of shape (1, 1541)
and pixel_values of shape
(2, 1, 3 384, 384)
. From the input_ids are 1512 default image ids.
I have debugged the code and got the shapes (1, 1541, 896)
for the input_embeds, image_feature and for the special_image_mask (2, 729, 896)
.
Do you have any idea what the error could be?
Hey @zucchini-nlp and @NielsRogge👋,
I created a notebook for fine-tuning Llava-OneVision-0.5b-ov-hf on the BLINK Benchmark, based on the notebook of LLAVA-NeXT. This notebook could be helpful for other folks to get an introduction to multi-image tasks with Llava-OneVision. During the implementation, a few questions arose: