begeekmyfriend / tacotron2

Forked from NVIDIA/tacotron2 and merged with Rayhane-mamah/Tacotron-2
BSD 3-Clause "New" or "Revised" License
81 stars 38 forks source link

not stable when training the model in the first epoch #9

Closed wenbozhangjs closed 4 years ago

wenbozhangjs commented 4 years ago

Thanks for sharing your work. But I have a problem when training the model. The iteration training time is not stable in the epoch1 but become stable after first epoch.

After several normal iterations,the training time will increase to 15 sec/iter and then go back to normal iteration(1.5 sec/iter). This situation (maybe 5 normal step -> 1 abnormal step-> 5 normal step-> 1 abnormal step...........)will be repeated in the epoch1.

Please see the screenshot below. from step314 to step319, training time is normal.But step 320 has large training time(at the same time GPU memory usage will decrease to 10%) . Do you know what's the problem? 微信图片_20191219212241

begeekmyfriend commented 4 years ago

Never happened to me on this issue. You might check out what other processes have blocked your threading where the data loader is swapping between CPU and GPU memory. Or the PyTorch version is suitable for your machine. By the way, when the GPU usage drops it often means the CPU threading has been blocked.

wenbozhangjs commented 4 years ago

Thanks for your reply. I will check the code again. But it's weird, from the epoch2, the training procedure is normal. The above situation never happens again

begeekmyfriend commented 4 years ago

I am afraid there must be other disturbing processes.