WongKinYiu / yolov9

Implementation of paper - YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information
GNU General Public License v3.0
8.85k stars 1.38k forks source link

Training on 4090, yolov9-c, converted model is degraded #414

Open AlexAnnecy opened 4 months ago

AlexAnnecy commented 4 months ago

Hello @WongKinYiu

I've been doing trainings attempts since a week, on a GeForce 4090, but the problem is:

I've tried in dozens of software configurations, including the container.

The same problem doesn't happen on a weaker GeForce 2070S

Have you tried on 4090? This is very weird.

(Now trying on a A4500).

levipereira commented 4 months ago

I have no issue with rtx 4090, but I'm using pytorch:23.02-py3 instead pytorch:21.01-py3

https://github.com/levipereira/yolov9-qat

Ada Lovelace is supported from CUDA 11.8. pytorch:21.11-py3 have CUDA 11.5 https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_21-11.html https://developer.nvidia.com/blog/cuda-toolkit-11-8-new-features-revealed/#:~:text=NVIDIA%20announces%20the%20newest%20CUDA,speedup%20through%20new%20hardware%20capabilities.

maybe you can try it.

AlexAnnecy commented 4 months ago

Will try tomorrow the docker you mentioned. I tried barebones with rtx 4090, and there is discrepancy after reparameterization. Also, on a A4500, same, but smaller, discrepancy.

AlexAnnecy commented 4 months ago

@levipereira for me, it's the same problem with the docker container you mentioned: when training from scratch a custom model (derived from yolo-c.yaml), the mAP of the trained model is different than the mAP after removing the auxiliary branch.

(@WongKinYiu did you ever encountered this when training you models: train model has different outputs than reparametrized model? )

WongKinYiu commented 3 months ago

If you do not finish training, you have to strip your model before do reparameterization.