zeal-up commented 4 weeks ago

Hi~,I am recently trying to use the llava_onevision model, I try to follow the onevision tutorial, which seems pretty easy. I run the program exactly as the tutorial, the model is 0.5b_si. However, a ValueError raised when loading the checkpoint

Please install pyav to use video processing functions.                                                                                                                                  
Loaded LLaVA model: /home/docker/.cache/huggingface/manually_hub/llava-onevision-qwen2-0.5b-si                                                                                          
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
You are using a model of type llava to instantiate a model of type llava_qwen. This is not supported for all configurations of models and can yield errors.
Loading vision tower: /home/docker/.cache/huggingface/manually_hub/siglip-so400m-patch14-384                                                                                            
You are using a model of type siglip to instantiate a model of type clip_vision_model. This is not supported for all configurations of models and can yield errors.
Some weights of CLIPVisionModel were not initialized from the model checkpoint at /home/docker/.cache/huggingface/manually_hub/siglip-so400m-patch14-384 and are newly initialized becau
se the shapes did not match:                  
- vision_model.embeddings.patch_embedding.weight: found shape torch.Size([1152, 3, 14, 14]) in the checkpoint and torch.Size([768, 3, 32, 32]) in the model instantiated
- vision_model.embeddings.position_embedding.weight: found shape torch.Size([729, 1152]) in the checkpoint and torch.Size([50, 768]) in the model instantiated
- vision_model.encoder.layers.0.layer_norm1.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([768]) in the model instantiated
- vision_model.encoder.layers.0.layer_norm1.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([768]) in the model instantiated
- vision_model.encoder.layers.0.layer_norm2.bias: found shape torch.Size([1152]) in the checkpoint and torch.Size([768]) in the model instantiated
- vision_model.encoder.layers.0.layer_norm2.weight: found shape torch.Size([1152]) in the checkpoint and torch.Size([768]) in the model instantiated

...
...
...

Traceback (most recent call last):
  File "/XBrain/xcaption/examples/test_model/llava_onevision.py", line 23, in <module>
    tokenizer, model, image_processor, max_length = load_pretrained_model(pretrained, None, model_name, device_map=device_map)  # Add any other thing you want to pass in llava_model_args
  File "/XBrain/third_party/LLaVA-NeXT-org/llava/model/builder.py", line 224, in load_pretrained_model
    model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 3754, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4214, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/opt/conda/lib/python3.10/site-packages/transformers/modeling_utils.py", line 887, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/opt/conda/lib/python3.10/site-packages/accelerate/utils/modeling.py", line 358, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1152, 3, 14, 14]) in "weight" (which has shape torch.Size([768, 3, 32, 32])), this look incorrect.

It seems that no one report this issue.

Luodian commented 4 weeks ago

144

Can you check the discussion and update code to try again? It's weird you are loading with a CLIP model's architecture.

jxgu1016 commented 3 weeks ago

same issue here

SorryMaker2022 commented 3 weeks ago

Loading vision tower: /home/docker/.cache/huggingface/manually_hub/siglip-so400m-patch14-384

It seems that you're using your vision tower from a local path, which will then be loaded as a clip vision tower, not a siglip vision tower.

https://github.com/LLaVA-VL/LLaVA-NeXT/blob/e98849102929e1c6304b60b28cca541567b7b643/llava/model/multimodal_encoder/builder.py#L15-L23

Exchanging the if and elif predicates may solve the problem. But note that this modification may have potential side-effect to other vision towers.

JinhuiYE commented 1 week ago

this one can work well. but I am unsure why returning with output_attentions will cause NAN and output ["!"]. there are not issue for 0.5B and 72B.

                    return_dict_in_generate = model.generate(
                        input_ids,
                        images=image_tensors,
                        attention_mask=attention_masks,
                        pad_token_id=tokenizer.pad_token_id,
                        use_cache=True,
                        # output_attentions=True, return_dict_in_generate=True,
                        **gen_kwargs
                        )

LLaVA-VL / LLaVA-NeXT

Weight size mismatch when load_pretrain for llava_onevision (0.5b model) #148

144