Closed hengseuer closed 1 week ago
Hi @hengseuer Yes. The training of the text prompt branch needs more data and a longer time to convergence, so we train the text prompt first.
Thank you for your response.
I have another question: When training text and visual prompts simultaneously, do the negative samples for the visual prompts come from the image itself, the current batch, or is there a maintained pool of negative samples?
The negative samples for visual prompts are sampled from current mini batch
Thanks a lot.
Are all the samples in the current mini-batch from the same dataset?
If, during the current iteration, all the samples across the GPUs are from the same dataset and we sample negative examples from within the entire batch, similar to the approach used in DINOv, would this result in better performance?
Our implementation only samples negative prompts from the current GPU. Using the sampling strategy in DINOv might bring more performance boosts.
Got it. Thanks.
Hello,
I have a question about the training process of T-rex2. Does T-rex2 first train the text prompts and then train both the text and visual prompts in successive iterations?
Thank you!