From your reply, I know that: T-rex2 first trains the text prompts and then trains both the text and visual prompts in successive iterations;
So when training text and visual prompts in successive iterations, do the text prompt encoder and decoder use a smaller learning rate, while the visual prompts encoder uses a larger learning rate?
From your reply, I know that: T-rex2 first trains the text prompts and then trains both the text and visual prompts in successive iterations;
So when training text and visual prompts in successive iterations, do the text prompt encoder and decoder use a smaller learning rate, while the visual prompts encoder uses a larger learning rate?
If so, is there a specific value?