Closed wenbozhangjs closed 4 years ago
Never happened to me on this issue. You might check out what other processes have blocked your threading where the data loader is swapping between CPU and GPU memory. Or the PyTorch version is suitable for your machine. By the way, when the GPU usage drops it often means the CPU threading has been blocked.
Thanks for your reply. I will check the code again. But it's weird, from the epoch2, the training procedure is normal. The above situation never happens again
I am afraid there must be other disturbing processes.
Thanks for sharing your work. But I have a problem when training the model. The iteration training time is not stable in the epoch1 but become stable after first epoch.
After several normal iterations,the training time will increase to 15 sec/iter and then go back to normal iteration(1.5 sec/iter). This situation (maybe 5 normal step -> 1 abnormal step-> 5 normal step-> 1 abnormal step...........)will be repeated in the epoch1.
Please see the screenshot below. from step314 to step319, training time is normal.But step 320 has large training time(at the same time GPU memory usage will decrease to 10%) . Do you know what's the problem?