In your paper, after extracting features from Visual Prompt Encoder and Image Encoder, model will compute the similarity between the encoder feature and the prompt embeddings. The prompt embeddings here is V that calculated in Visual Prompt Encoder? You wrote V = FFN (Selfattn (q ′)) [− 1] in the Visual Prompt Encoder section of the paper. Is it equal to C'+B '? And what is its size?
In your paper, after extracting features from Visual Prompt Encoder and Image Encoder, model will compute the similarity between the encoder feature and the prompt embeddings. The prompt embeddings here is V that calculated in Visual Prompt Encoder? You wrote V = FFN (Selfattn (q ′)) [− 1] in the Visual Prompt Encoder section of the paper. Is it equal to C'+B '? And what is its size?