Open CauchyFanUpdate opened 2 months ago
Because the resolution of the visual encoder is variable, and I want the visual features processed by LLM to be of fixed size (16x16), I will perform a size transformation on the features produced by the visual encoder, resulting in two variables, image_features and pre_image_features . One will be used by LLM for visual content understanding, and the other will be passed to the decoder to generate a mask.
In LISA, the ViT visual encoder performs Crop processing on the image, that is, cutting the long sides. However, PixelLM does not perform Crop, but transforms the size proportionally. This will lead to Padding, so the ViT visual encoder requires attention mask. In the same way, LLM also needs the corresponding ViT attention mask when processing visual features.
Thank you very much for your work. I have a question about
image_features, pre_image_features = vision_tower(images, attention_mask=flatten_vit_attention_mask)
. What does this line do, and how does it differ fromimage_features = self.get_model().get_vision_tower()(images)
in llava? Additionally, what is the purpose offlatten_vit_attention_mask = torch.cat((torch.ones(flatten_vit_attention_mask.shape[0], 1).to(flatten_vit_attention_mask), flatten_vit_attention_mask), dim=-1)
? I hope you can help clarify this for me.