.cpu() operation as bottleneck in training

The .cpu() operation used in Hungarian matching, to bring tensor to cpu for linear-sum-assignment , takes a significant amount of time, as compared to the entire forward pass. Is there a specific method of it's usage, which (possibly) handles it's time consumption? I am using Hungarian matching in one of my work, and using the .cpu() operation has significantly increased the training time.

facebookresearch / detr

.cpu() operation as bottleneck in training #562