bronyayang / Law_of_Vision_Representation_in_MLLMs

Official implementation of the Law of Vision Representation in MLLMs
https://arxiv.org/abs/2408.16357
123 stars 7 forks source link

About the evaluation with clipdino336 #7

Open XpracticeYSKM opened 5 hours ago

XpracticeYSKM commented 5 hours ago

Thanks for your awesome work! After I configured the environment according to the warehouse, I ran the scripts about clipdino336: accelerate launch --num_processes=1 -m lmms_eval --model llava --model_args pretrained="checkpoint/llava_clipdino336_stage2",device_map="cuda" --tasks ok_vqa --batch_size 1 --log_samples --log_samples_suffix llava_clipdino336_stage2 --output_path ./logs/llava_clipdino336_stage2,

but found that an error occurred: [lmms_eval/models/llava.py:528] ERROR Error Sizes of tensors must match except in dimension 2. Expected size 576 but got size 256 for tensor number 1 in the list. in generating

After debugging, I found clip shape is [1,576,1024] and dino shape is [1,256,1024]. Two features cannot be concat due to spatial space. Is there any error in this part of the code and could you provide the correct code?

    def encode_images(self, images):
        if type(images) is not list:
            image_features = self.get_model().get_vision_tower()(images)
            image_features = self.get_model().mm_projector(image_features)
        else:
            vision_tower = self.get_model().get_vision_tower()
            if type(vision_tower) is nn.ModuleList:
                f_list = []
                for i, v in enumerate(vision_tower):
                    image_features = v(images[i])
                    f_list.append(image_features)
                import pdb;pdb.set_trace()
                image_features = torch.cat(f_list, dim=-1)
                image_features = self.get_model().mm_projector(image_features)
        return image_features