Hi! In your code the function "concat_all_gather" is only used in enqueue_and_dequeue function, and the function is processed on CPU. I have 3 questions: 1) why are the prototypes not gathered by using this function? I think the prototype should remain same cross GPUs; 2) when do we need to use this function? 3) why is the keys detached and transfered to CPU before gathering (code link is here ? Thanks!
Taking the vanilla cross-entropy loss as an example. When we are computing the objective across GPUs, it is not necessarily to ensure that the segmentation logits remain the same. For the same reason, prototypes are not necessarily the same across GPUs, as are queries and their negative keys.
The queue that stores all negative keys should be the same across GPUs, and thus this function is needed.
All negative keys are expected to be detached from the gradient.
Hi! In your code the function "concat_all_gather" is only used in enqueue_and_dequeue function, and the function is processed on CPU. I have 3 questions: 1) why are the prototypes not gathered by using this function? I think the prototype should remain same cross GPUs; 2) when do we need to use this function? 3) why is the keys detached and transfered to CPU before gathering (code link is here ? Thanks!