How can Visual Prompt Encoder handle a number of classes in batch?

Hello, thank you for your work and for sharing the code.

After reading your paper, I have some questions regarding the implementation of the visual prompt encoder.

From previous issues, I understand that a visual prompt is generated per class in each batch. This seems to imply that the MSDeformAttentionLayer must run multiple times, proportional to the number of classes per image and then across the entire batch.

For example, in a batch with two images, image 1 with (classes a, b, c), and image 2 with (classes d, e, f)—we would need to execute MSDeformAttentionLayer for a total of 2 (images) * 6 (prompts for each class) times.

When the batch size or the number of classes per batch increases, this approach seems to demand a substantial amount of memory, possibly making it challenging to scale. Or it can be computed using for loop, but it also takes a long time.

Could you clarify how you handle this demand?

Thank you!

IDEA-Research / T-Rex

How can Visual Prompt Encoder handle a number of classes in batch? #89