Megvii-BaseDetection / YOLOX

YOLOX is a high-performance anchor-free YOLO, exceeding yolov3~v5 with MegEngine, ONNX, TensorRT, ncnn, and OpenVINO supported. Documentation: https://yolox.readthedocs.io/
Apache License 2.0
9.26k stars 2.18k forks source link

Occasionally The ‘iter_time‘’ become longer and longer #664

Open xuezu29 opened 2 years ago

xuezu29 commented 2 years ago

previous epoch: image The iter_time(1s) is normal.

after a while: image The iter_time(6s-10s) is too long.

It happens sometimes , and then it's normal to restart training. Any suggestions? thx a lot!

GOATmessi8 commented 2 years ago

Plz check whether there is any other process running in your GPUs.

xuezu29 commented 2 years ago

Plz check whether there is any other process running in your GPUs.

I'm sure there is no other process running on the GPUs.

xuezu29 commented 2 years ago

@ruinmessi This problem will not appear when I restart training. And when the 'iter_time' become longer, I used 'watch nvidia-smi' to check the GPU status, 'Volatile GPU-Util' stay at a low value(0-10% ) for a long time.

GOATmessi8 commented 2 years ago

Do you have many gt objects in one image? Your problem may be difficult to locate. And I suggest you use line_profiler to test the time consuming of each line in the training loop