Large GPU memory consumption on Tensorrt8 branch

I was been using tkdnn with tensort6/cuda10.0 for a while and everything works fine. Recently I upgrade the gpu card from 2070 to A4000, so a upgrade of all related drivers is necessary. Now the new environment is cuda11.5/tensorrt8.2.2.1/opencv4.5.5. Now with the new environment, using the same trained model (Yolov4, network size: 530x320), the gpu memory usage increase from roughly 1GB to 2.5GB with fp32, 600MB to 1.9GB with fp16.

Any ideas why this is happening? Thanks a lot for the great work by the way!

ceccocats / tkDNN

Large GPU memory consumption on Tensorrt8 branch #279