Thanks for your awesome work!
After I configured the environment according to the warehouse, I ran the scripts about clipdino336:
accelerate launch --num_processes=1 -m lmms_eval --model llava --model_args pretrained="checkpoint/llava_clipdino336_stage2",device_map="cuda" --tasks ok_vqa --batch_size 1 --log_samples --log_samples_suffix llava_clipdino336_stage2 --output_path ./logs/llava_clipdino336_stage2,
but found that an error occurred:
[lmms_eval/models/llava.py:528] ERROR Error Sizes of tensors must match except in dimension 2. Expected size 576 but got size 256 for tensor number 1 in the list. in generating
After debugging, I found clip shape is [1,576,1024] and dino shape is [1,256,1024]. Two features cannot be concat due to spatial space. Is there any error in this part of the code and could you provide the correct code?
def encode_images(self, images):
if type(images) is not list:
image_features = self.get_model().get_vision_tower()(images)
image_features = self.get_model().mm_projector(image_features)
else:
vision_tower = self.get_model().get_vision_tower()
if type(vision_tower) is nn.ModuleList:
f_list = []
for i, v in enumerate(vision_tower):
image_features = v(images[i])
f_list.append(image_features)
import pdb;pdb.set_trace()
image_features = torch.cat(f_list, dim=-1)
image_features = self.get_model().mm_projector(image_features)
return image_features
Thanks for your awesome work! After I configured the environment according to the warehouse, I ran the scripts about clipdino336:
accelerate launch --num_processes=1 -m lmms_eval --model llava --model_args pretrained="checkpoint/llava_clipdino336_stage2",device_map="cuda" --tasks ok_vqa --batch_size 1 --log_samples --log_samples_suffix llava_clipdino336_stage2 --output_path ./logs/llava_clipdino336_stage2
,but found that an error occurred:
[lmms_eval/models/llava.py:528] ERROR Error Sizes of tensors must match except in dimension 2. Expected size 576 but got size 256 for tensor number 1 in the list. in generating
After debugging, I found clip shape is [1,576,1024] and dino shape is [1,256,1024]. Two features cannot be concat due to spatial space. Is there any error in this part of the code and could you provide the correct code?