deeptibhegde / CLIP-goes-3D

Official code release of "CLIP goes 3D: Leveraging Prompt Tuning for Language Grounded 3D Recognition"
https://jeya-maria-jose.github.io/cg3d-web/
214 stars 11 forks source link

About the Visual Prompt #8

Open Coobiw opened 1 year ago

Coobiw commented 1 year ago

Hello,thanks for your great work. But I have some questions about the visual prompts especially the modificaitons on timm. Firstly, I find that you have annotated the code below:

image

So,is this code work now? What is the current version implement of visual prompt?

And then, I want to know where the visual prompt added to the ViT. The code bellow shows that you concat [cls] token, visual prompt , image patch tokens together on the 'seq length' dim, isn't it?

So just do self-attn on the learnable visual prompt and image patches tokens on the input layer? Or every layer except the input layer which is the annotated code do?

The former:

image

The Latter:

image image

Thanks for your answering!

deeptibhegde commented 1 year ago

Hi, I believe you have an older version of the model files in which I had commented some lines for testing. Please re-download the package from the link and let me know if you have any issue. The visual prompts are added to the input of every transformer encoder of the model, i.e. VisionTransformerPromptDeep.

Coobiw commented 1 year ago

Thanks! I will update it! Do you mean that the visual-prompts are added to every transformer block?

deeptibhegde commented 1 year ago

Yes

Coobiw commented 1 year ago

Thanks for your answer! I've got the visual prompt ops. Additionally, if convinient, I would like to ask you for the performance difference between VisionTransformerPromptDeep and VisionTransformerPrompt, i.e. whether adding it to the layer_0 or every layer will lead some gap?