Significant errors in confidence scores after TensorRT conversion

TL;DR: After TensorRT model conversion, confidence scores drop in accuracy much more than bounding box values. What property of TensorRT or the YOLO architecture causes this?

I use tkDNN for converting yolov4 and yolov4-csp models. The detections using the converted TensorRT engines look very similar visually to the original darknet detections, but on closer inspection, there are some noticeable differences.

Bounding box values before and after conversion are very very similar, with only a few outliers (< 0.1%) of values shifting in value by more than 0.02. In contrast, confidence scores before and after conversion can differ significantly, even as much as by 0.5! Of course, the converted network won't perform the same as before, but it seems surprising that bounding box values are highly consistent post-conversion, while confidence values can be occasionally inconsistent by a high margin.

Here's an example of before and after detections to illustrate the behavior I mean. Note that some detection inputs and networks perform much worse than this, e.g. for yolov4 pretrained weights I can get as much as one third of confidence scores shifting in value by more than 0.05.

tkdnn-fp32-accuracy

Possible explanations that I've ruled out:

Floating point precision seems unrelated. Performance differences between FP32 and FP16 variants of a network are negligible.
This occurs with yolov4 and yolov4-csp, both pretrained and custom trained weights.
This phenomenon isn't unique to tkDNN, i.e., I observe very similar properties when converting a network within DeepStream.

ceccocats / tkDNN

Significant errors in confidence scores after TensorRT conversion #306