ParadoxZW / LLaVA-UHD-Better

A bug-free and improved implementation of LLaVA-UHD, based on the code from the official repo
Apache License 2.0
31 stars 3 forks source link

Position Embedding in class Resampler #5

Open lntzm opened 3 months ago

lntzm commented 3 months ago

Thanks for your contribution. In the original LLaVA-UHD, the self.pos_embed in class Resampler is frozen, though they wrongly make it trainable again the in train.py as they fine-tune the whole mm_projector. In your code, I notice that you directly make it trainable. I wonder whether this should be frozen and how it affects the performance.

Moreover, LLaVA-UHD adds position embedding for q when calculating attention. Is it necessary for your code to add downsampled position embedding for q?

ParadoxZW commented 3 months ago
  1. I do take consideration of freeze self.pos_embed. I am currently running model of this configuration. If I verify it is a better design, I will update that code to this repo.
  2. q is learnable query. It is essentially a set of model weights. I think there is no reason to do additional calculation on q before q attending image features.
lntzm commented 3 months ago
  1. Looking forward to your results.
  2. I guess q here is to 'downsample' v, and it is more like a learnable pooling strategy. So LLaVA-UHD adds position embedding to q and the position embedding of k is interpolated by it of q, providing a downsample prior knowledge.