WongKinYiu / yolov7

Implementation of paper - YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors
GNU General Public License v3.0
13.37k stars 4.22k forks source link

The inference results of trt and pt models are different #528

Open leayz-888 opened 2 years ago

leayz-888 commented 2 years ago

thanks for your great work I used the pre-trained model yolov7.pt you provided and the conversion code you provided to successfully convert the pt format model into a trt model. The inference results of the pt model and the trt model are consistent. However, when I used yolov7 to train my own dataset, I also converted the saved pt model to trt format. I tested and found that the inference results of the pt model and the inference results of trt were very different. May I ask why this is? The training command I use is: python -m torch.distributed.launch --nproc_per_node 2 --master_port 9527 train.py --workers 8 --device 0,1 --sync-bn --batch-size 48 --data data/dataset.yaml -- img 640 640 --cfg cfg/training/yolov7.yaml --weights '' --name yolov7 --hyp data/hyp.scratch.p5.yaml My instructions for converting pt model to onnx model and trt model are: python export.py --weights runs/train/yolov7/best.pt --grid --end2end --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 640 640 python export.py -o home/yolov7/runs/train/yolov7/best.onnx -e yolov7-ours-nms.trt -p fp16

wstchhwp commented 2 years ago

I meet the problem too. torch pt model score is very high, but onnx model score is very slow.

DMLON commented 2 years ago

Same issue here, trained yolov7 on a custom dataset. torch model returns scores above 0.9 but trt converted model returns scores around 0.4 - 0.5 for the same image. Tested trt convertion from trtexec, Linaom1214/tensorrt-python and triple-Mu/YOLO-TensorRT8, both on fp32 and fp16

Torch model: image

TRT model: image

wstchhwp commented 2 years ago

Same issue here, trained yolov7 on a custom dataset. torch model returns scores above 0.9 but trt converted model returns scores around 0.4 - 0.5 for the same image. Tested trt convertion from trtexec, Linaom1214/tensorrt-python and triple-Mu/YOLO-TensorRT8, both on fp32 and fp16

Torch model: image

TRT model: image

I met the same problem with you.

WongKinYiu commented 2 years ago

Train (cfg/training/) -> Reparameterization (cfg/deploy) -> Convert.

Maybe you miss the reparameterization step.

leayz-888 commented 2 years ago

Maybe you miss the reparameterization step.

Thanks for your reply, can you tell me exactly how to do the reparameterization?

WongKinYiu commented 2 years ago

https://github.com/WongKinYiu/yolov7#re-parameterization

DMLON commented 2 years ago

Thank you for your reply! Just tested reparametrization (The only difference between traning/yolov7.yaml and deploy/yolov7.yaml is that on the last line, IDetect is used instead of Detect), same results, reconverted the model to onnx and trt. In a few hours i'm going to try running the model in ONNX runtime to check if the model is failing on the ONNX conversion or the trt conversion

leayz-888 commented 2 years ago

https://github.com/WongKinYiu/yolov7#re-parameterization

Thank you, I followed your instructions to re-parameterize the trained yolov7.pt and convert it to trt format. I tested it and got some results:

  1. The inference results of the pt model after re-parameterization are consistent with those of the pt model before re-parameterization;
  2. The reparameterized pt model can be successfully converted into trt format, but there is still a big gap between the inference results and the inference results of the pt model. I have tried both fp16 and fp32, and the result is the same, I am confused about this, can you answer it again? thank you again

Thank you for your reply! Just tested reparametrization (The only difference between traning/yolov7.yaml and deploy/yolov7.yaml is that on the last line, IDetect is used instead of Detect), same results, reconverted the model to onnx and trt. In a few hours i'm going to try running the model in ONNX runtime to check if the model is failing on the ONNX conversion or the trt conversion

After re-parameterization, there is still a big difference between the inference results of the trt model and the pt model.

ghost commented 2 years ago

Important for Reparametization is to change your nc (number of classes inside the reparametization file) accordingly to your custom dataset. That can cause weird issues while inferencing if not done correctly. Also its important to use opset 11 when exporting the model at least for me that was the only opset version which worked

leayz-888 commented 2 years ago

Important for Reparametization is to change your nc (number of classes inside the reparametization file) accordingly to your custom dataset. That can cause weird issues while inferencing if not done correctly. Also its important to use opset 11 when exporting the model at least for me that was the only opset version which worked

Important for Reparametization is to change your nc (number of classes inside the reparametization file) accordingly to your custom dataset. That can cause weird issues while inferencing if not done correctly. Also its important to use opset 11 when exporting the model at least for me that was the only opset version which worked

Thanks, I have successfully reparameterized, and the inference results of the model before and after the reparameterization are the same. On my own data set, I converted the reparameterized model into trt, the inference results of trt and the inference results of pt model are very different, not only the confidence of the target has decreased, but even a lot of detection wrong target.

ghost commented 2 years ago

I also had alot weird problems with converting my trained model. My not so efficient but maybe effective solution would be to try every conversion out. By that i mean alot different exporting commands. For example adding your batch-size or simplify model and add your img-size etc to the exporting process. Also different opset versions. After alot trying around i finally found out what i need to do for my Opencv DNN Inference so maybe that experimental approach could work for you too. Otherwise i can only imagine that the TensorRT conversion from .onnx to .trt messes up somewhere. And to check that i recommend checking the supported .onnx models for the .trt Engine conversion.

leayz-888 commented 2 years ago

mple adding your batch-size or simplify model and add your img-size etc to the exporting process.

ok,thanks for the suggestion, i'll try it, by the way, what's the conversion command you're using?

ghost commented 2 years ago

I mean since its not for tensort but for Opencv DNN i guess it wont help but simply: export.py --weights [myweights].pt --img [myimagesize] --include onnx --opset 11 https://github.com/WongKinYiu/yolov7/tree/u5 this branch was used for export. I read in a different thread that this branch is required

DMLON commented 2 years ago

Update: So i managed to run the model with ONNX runtime, same issue, so the problem isn't the TRT conversion, should be how the model is exported into ONNX

ghost commented 2 years ago

When you tried out my command and its not working im clueless tbh maybe somebody professional can help here

leayz-888 commented 2 years ago

Update: So i managed to run the model with ONNX runtime, same issue, so the problem isn't the TRT conversion, should be how the model is exported into ONNX I may have found the reason:

  1. When using the pt model for inference, the script detect.py is used, the --rect parameter is set to True, and the image is filled with a rectangle to speed up inference, instead of (640,640) in the onnx and trt models;
  2. When using the trt model for inference, the data preprocessing step converts the BGR format of the image into RBG format for inference, but does not convert the image into RGB format during the inference of the pt model, which will cause differences in inference results . Now the inference results of my pt model and trt model are basically the same, I hope this will help you.
lxzatwowone1 commented 2 years ago

Update: So i managed to run the model with ONNX runtime, same issue, so the problem isn't the TRT conversion, should be how the model is exported into ONNX I may have found the reason:

  1. When using the pt model for inference, the script detect.py is used, the --rect parameter is set to True, and the image is filled with a rectangle to speed up inference, instead of (640,640) in the onnx and trt models;
  2. When using the trt model for inference, the data preprocessing step converts the BGR format of the image into RBG format for inference, but does not convert the image into RGB format during the inference of the pt model, which will cause differences in inference results . Now the inference results of my pt model and trt model are basically the same, I hope this will help you. I have the same problem, trt inference score much lower,but boxes is same as pt. I founf your 2. is wrong , pt model inference also have BGR2GRGB in def letterbox. Can tell me how to solve bring the trt score same as pt score, Thank you very much!!!
wstchhwp commented 2 years ago

Update: So i managed to run the model with ONNX runtime, same issue, so the problem isn't the TRT conversion, should be how the model is exported into ONNX I may have found the reason:

  1. When using the pt model for inference, the script detect.py is used, the --rect parameter is set to True, and the image is filled with a rectangle to speed up inference, instead of (640,640) in the onnx and trt models;
  2. When using the trt model for inference, the data preprocessing step converts the BGR format of the image into RBG format for inference, but does not convert the image into RGB format during the inference of the pt model, which will cause differences in inference results . Now the inference results of my pt model and trt model are basically the same, I hope this will help you.

Can you tell me how to solve bring the onnx score same as pt score, Thank you very much!!!

mgodbole1729 commented 2 years ago

I mean since its not for tensort but for Opencv DNN i guess it wont help but simply: export.py --weights [myweights].pt --img [myimagesize] --include onnx --opset 11 https://github.com/WongKinYiu/yolov7/tree/u5 this branch was used for export. I read in a different thread that this branch is required

I am unable to load the onnx modle using this command with opencv dnn, What did you use?

DMLON commented 2 years ago

Update: So i managed to run the model with ONNX runtime, same issue, so the problem isn't the TRT conversion, should be how the model is exported into ONNX I may have found the reason:

  1. When using the pt model for inference, the script detect.py is used, the --rect parameter is set to True, and the image is filled with a rectangle to speed up inference, instead of (640,640) in the onnx and trt models;
  2. When using the trt model for inference, the data preprocessing step converts the BGR format of the image into RBG format for inference, but does not convert the image into RGB format during the inference of the pt model, which will cause differences in inference results . Now the inference results of my pt model and trt model are basically the same, I hope this will help you.

Could you please provide step by step instructions on how you did it? From what i understood from your explanation, you pre processed the image, switching from BGR channels to RBG channels? Tried that, same issue

sapjunior commented 2 years ago

I also encounter this problem too. I think that the problem may not come from reparamerization since current export.py is already include that process. May I ask that are you guys working on single class detection?

It might related to this pull request

https://github.com/WongKinYiu/yolov7/pull/305

which refer to https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/utils/general.py#L649

and exported onnxruntime NMS in https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/models/experimental.py#L176

and also TRT NMS as well in https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/models/experimental.py#L208

Updated: Confirm that two lines in nms are related to incorrect probability in exported onnx and tensorRT. After editd those two line by using patch from DMLON, the output result are now consistence!

DMLON commented 2 years ago

I also encounter this problem too. I think that the problem may not come from reparamerization since current export.py is already include that process. May I ask that are you guys working on single class detection?

It might related to this pull request

305

which refer to

https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/utils/general.py#L649

and exported onnxruntime NMS in

https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/models/experimental.py#L176

and also TRT NMS as well in

https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/models/experimental.py#L208

Interesting finds, tested one of the models that have multiple classes and got better inferences (Some still have differences 0.92 vs 0.733). Seems that my single class model does not work due to that pull request? I'll test later changing those lines to see if we have better results

EDIT: Holy crap, that was it, ONNX model works like a charm for single class models, had to replace how the end2end function works by passing number of classes as parameter then integrated the same function as specified in

https://github.com/WongKinYiu/yolov7/blob/44d8ab41780e24eba563b6794371f29db0902271/utils/general.py#L649

Thanks a lot!! Making a pull request

leidix commented 1 year ago

The -simplify option in the export process messed up all my metrics. Exporting without simplify fixed it for me, took me forever to find the reason.