Open lntzm opened 4 months ago
self.pos_embed
. I am currently running model of this configuration. If I verify it is a better design, I will update that code to this repo.q
here is to 'downsample' v
, and it is more like a learnable pooling strategy. So LLaVA-UHD adds position embedding to q
and the position embedding of k
is interpolated by it of q
, providing a downsample prior knowledge.
Thanks for your contribution. In the original LLaVA-UHD, the
self.pos_embed
in classResampler
is frozen, though they wrongly make it trainable again the intrain.py
as they fine-tune the wholemm_projector
. In your code, I notice that you directly make it trainable. I wonder whether this should be frozen and how it affects the performance.Moreover, LLaVA-UHD adds position embedding for
q
when calculating attention. Is it necessary for your code to add downsampled position embedding forq
?