Open Gasso21 opened 7 months ago
Hello, I have a question while reading your paper. In this paper, you mentions the use of Query Transformer and Learnable Latent Vectors. Upon closer examination, it appears that Learnable Latent Vectors consist of weights repeated 4 times for Perceiver Attn and FF.
As I presume,
- To do separate training to enhance the image's details after passing through FrozenOpenCLIPImageEmbedderV2 and then through image_proj_stage_config.
- It seems that fine-tuning was done with the Spatial Attn (with freezing Temp Attn) alongside the mentioned weights without separate training.
If the first assumption is correct, I would like to know how you calculated the loss with the input image. or if the answer is second, I'm interested in understanding how you conducted the training. Could you please provide detailed explanations about that Query Transformer's trained weights?
Hi, do u see how to implement Query Transformer in the code? I didn't see it.
Hello, I have a question while reading your paper. In this paper, you mentions the use of Query Transformer and Learnable Latent Vectors. Upon closer examination, it appears that Learnable Latent Vectors consist of weights repeated 4 times for Perceiver Attn and FF.
As I presume,
If the first assumption is correct, I would like to know how you calculated the loss with the input image. or if the answer is second, I'm interested in understanding how you conducted the training. Could you please provide detailed explanations about that Query Transformer's trained weights?