WenjunHuang94 / ML-Mamba

ML-Mamba: Efficient Multi-Modal Large Language Model Utilizing Mamba-2
MIT License
50 stars 7 forks source link

Dinosiglip_vit torchcat Error: Dimension out of range (expected to be in range of [-2, 1], but got 2) #4

Closed YongLD closed 1 month ago

YongLD commented 1 month ago
09/10 [07:43:21] INFO     | >> [*] Loading from local path `/code/Basemodel/ML-Mamba`                                                                                                                                                                                   load.py:52
                 INFO     | >> [*] Found Config =>> Loading & Freezing mlmamba+3b with:                                                                                                                                                                                 load.py:87
                                       Vision Backbone =>> dinosiglip-vit-so-384px                                                                                                                                                                                                
                                       LLM Backbone    =>> mamba2-2.7b                                                                                                                                                                                                            
                                       Arch Specifier  =>> no-align+fused-gelu-mlp                                                                                                                                                                                                
                                       Checkpoint Path =>> `/code/Basemodel/ML-Mamba/latest-checkpoint.pt`                                                                                                                                                                        
                 INFO     | >> [*] Loading Pretrained LLM mamba2-2.7b via HF Transformers                                                                                                                                                                               load.py:96
                 INFO     | >>     |=> Building empty mamba2 LLM from `state-spaces/mamba2-2.7b`                                                                                                                                                                   base_llm.py:132
state-spaces/mamba2-2.7b
09/10 [07:43:24] INFO     | >>     |=> Loading mamba2 (Fast) Tokenizer via the AutoTokenizer API                                                                                                                                                                   base_llm.py:161
                 INFO     | >> [*] Loading Vision Backbone dinosiglip-vit-so-384px                                                                                                                                                                                     load.py:105
09/10 [07:44:42] INFO     | >> Loading pretrained weights from Hugging Face hub (timm/vit_large_patch14_reg4_dinov2.lvd142m)                                                                                                                                       _builder.py:186
                 INFO     | >>  Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.                                                                                                                        _hub.py:180
                 INFO     | >> Resized position embedding: (37, 37) to (27, 27).                                                                                                                                                                                   pos_embed.py:55
09/10 [07:46:30] INFO     | >> Loading pretrained weights from Hugging Face hub (('timm/ViT-SO400M-14-SigLIP-384', 'open_clip_pytorch_model.bin'))                                                                                                                 _builder.py:186
09/10 [07:46:31] INFO     | >>  Safe alternative available for 'open_clip_pytorch_model.bin' (as 'open_clip_model.safetensors'). Loading weights using safetensors.                                                                                                    _hub.py:180
09/10 [07:46:47] INFO     | >> [*] Loading VLM mlmamba+3b from Checkpoint; Freezing Weights 🥶                                                                                                                                                                         load.py:113
/code/Basemodel/ML-Mamba/latest-checkpoint.pt
Traceback (most recent call last):
  File "/code/ML-Mamba-main/test2.py", line 49, in <module>
    generated_text = vlm.generate(**generate_params)
  File "/root/anaconda3/envs/mamba/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/code/ML-Mamba-main/mlmamba/models/vlms/mlmamba.py", line 628, in generate
    generated_ids = self.mamba_generate(
  File "/code/ML-Mamba-main/mlmamba/models/vlms/mlmamba.py", line 132, in mamba_generate
    return MambaGenerationMixin.generate(self, *args, **kwargs)
  File "/code/ML-Mamba-main/mlmamba/models/mamba/modeling_mamba.py", line 213, in generate
    output = decode(
  File "/root/anaconda3/envs/mamba/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/code/ML-Mamba-main/mlmamba/models/mamba/modeling_mamba.py", line 170, in decode
    scores.append(get_logits(sequences[-1], inference_params))
  File "/code/ML-Mamba-main/mlmamba/models/mamba/modeling_mamba.py", line 129, in get_logits
    logits = model(
  File "/root/anaconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/code/ML-Mamba-main/mlmamba/models/vlms/mlmamba.py", line 395, in forward
    patch_features = self.vision_backbone({k: pixel_values[k][multimodal_indices] for k in pixel_values})  # The result dimension is 2716, where the output features of the image from two image models are merged
  File "/root/anaconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/root/anaconda3/envs/mamba/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/code/ML-Mamba-main/mlmamba/models/backbones/vision/dinosiglip_vit.py", line 143, in forward
    return torch.cat([dino_patches[0], siglip_patches[0]], dim=2)
IndexError: Dimension out of range (expected to be in range of [-2, 1], but got 2)

It seems that there is an issue with the visual module. I haven't made any changes to the code during this process. Could this bug be caused by the model update?

YongLD commented 1 month ago

The bug have fix. Base on the ML-Mamba-main/mlmamba/models/backbones/vision/dinosiglip_vit.py, I change the code return torch.cat([dino_patches[0], siglip_patches[0]], dim=2) to return torch.cat([dino_patches, siglip_patches], dim=2),

WenjunHuang94 commented 1 month ago

I apologize for the delay in responding. Your modification is indeed correct. I have also added comments in the code to reflect this change. The issue was caused by differences in library versions.  Thank you for your understanding.

YongLD commented 1 month ago

One more question, I saw that there is some additional datasets in the project, such as [LVIS-Instruct-4V] and [LRV-Instruct], but I do not find it's metioned in the paper. Have you used these datasets on the stage of finetuning?

YongLD commented 1 month ago

Buy the way, Code in scripts/pretrain.py:

dist.init_process_group(backend='nccl')

would be better to be changed as following :

 if not dist.is_initialized():
      dist.init_process_group(backend='nccl')
WenjunHuang94 commented 1 month ago

One more question, I saw that there is some additional datasets in the project, such as [LVIS-Instruct-4V] and [LRV-Instruct], but I do not find it's metioned in the paper. Have you used these datasets on the stage of finetuning?

I only used LLaVA v1.5 for fine-tuning in my ML-Mamba project. Although the source code supports pre-training with LVIS-Instruct-4V and LRV-Instruct datasets, I did not utilize these datasets during the fine-tuning stage.