image_features, pre_image_features = vision_tower(images, attention_mask=flatten_vit_attention_mask)

MaverickRen / PixelLM

PixelLM is an effective and efficient LMM for pixel-level reasoning and understanding. PixelLM is accepted by CVPR 2024.

Apache License 2.0

138 stars 4 forks source link

image_features, pre_image_features = vision_tower(images, attention_mask=flatten_vit_attention_mask) #15

Open CauchyFanUpdate opened 2 months ago

CauchyFanUpdate commented 2 months ago

Thank you very much for your work. I have a question about image_features, pre_image_features = vision_tower(images, attention_mask=flatten_vit_attention_mask). What does this line do, and how does it differ from image_features = self.get_model().get_vision_tower()(images) in llava? Additionally, what is the purpose of flatten_vit_attention_mask = torch.cat((torch.ones(flatten_vit_attention_mask.shape[0], 1).to(flatten_vit_attention_mask), flatten_vit_attention_mask), dim=-1)？ I hope you can help clarify this for me.

MaverickRen commented 1 month ago

Because the resolution of the visual encoder is variable, and I want the visual features processed by LLM to be of fixed size (16x16), I will perform a size transformation on the features produced by the visual encoder, resulting in two variables, image_features and pre_image_features . One will be used by LLM for visual content understanding, and the other will be passed to the decoder to generate a mask.

MaverickRen commented 1 month ago

In LISA, the ViT visual encoder performs Crop processing on the image, that is, cutting the long sides. However, PixelLM does not perform Crop, but transforms the size proportionally. This will lead to Padding, so the ViT visual encoder requires attention mask. In the same way, LLM also needs the corresponding ViT attention mask when processing visual features.