Reduce the space complexity of the HungarianMatcher module.

The memory reduction factor of the cost matrix is sum(#target objects) / max(#target objects).

That is achieved by no longer computing and storing matching costs between predictions and targets at different positions inside the batch. More exactly the original matrix of shape [batch_size * queries, sum(#target objects)] is shrinked to a tensor of shape [batch_size, queries, max(#target objects)].

Besides allowing much larger batch sizes, tested on the table structure recognition task using the Table Transformer (TATR) (125 queries, 7 classes) with Pubmed data, this change also results a) on CUDA at all batch sizes and on CPU with small batches in a small but meaningful speedup, b) on CPU with larger batch sizes in much higher speedups.

The processing time reduction computed as (1 - new_time / old_time) is shown below in various configurations:

Batch	1	2	3	4	5	6	7	8	16	32	64
CUDA	8.2%	1.6%	1.6%	0.9%	0.8%	0.9%	0.9%
CPU	1.6%	9.3%	7.7%	11.2%	13.9%	15.5%	23.1%	47.1%	70.6%	88.3%	95.0%

facebookresearch / detr

Reduce the space complexity of the HungarianMatcher module. #606