facebookresearch / detr

End-to-End Object Detection with Transformers
Apache License 2.0
13.08k stars 2.37k forks source link

Reduce the space complexity of the HungarianMatcher module. #606

Open aioaneid opened 9 months ago

aioaneid commented 9 months ago

The memory reduction factor of the cost matrix is sum(#target objects) / max(#target objects).

That is achieved by no longer computing and storing matching costs between predictions and targets at different positions inside the batch. More exactly the original matrix of shape [batch_size * queries, sum(#target objects)] is shrinked to a tensor of shape [batch_size, queries, max(#target objects)].

Besides allowing much larger batch sizes, tested on the table structure recognition task using the Table Transformer (TATR) (125 queries, 7 classes) with Pubmed data, this change also results a) on CUDA at all batch sizes and on CPU with small batches in a small but meaningful speedup, b) on CPU with larger batch sizes in much higher speedups.

The processing time reduction computed as (1 - new_time / old_time) is shown below in various configurations:

Batch 1 2 3 4 5 6 7 8 16 32 64
CUDA 8.2% 1.6% 1.6% 0.9% 0.8% 0.9% 0.9%
CPU 1.6% 9.3% 7.7% 11.2% 13.9% 15.5% 23.1% 47.1% 70.6% 88.3% 95.0%