The training speed is too slow with same configuration on 4090

Chloe-qiuyu commented 7 months ago

Hello Ruijie, Why does it take me more than four hours to train an epoch in Stage 1 with the same configuration on 4090? This problem also exists in Stage 2!

TaoRuijie commented 7 months ago

Your GPU utilization has the problem, that might due to the dataloader issue. You can check this Chinese video: https://www.bilibili.com/video/BV1dF411g7t1/

varun-krishnaps commented 7 months ago

Hi TouRuijie,

Even I face the same issue. Training is very slow in both the stages. Can u please suggest me fixes.....I'm a non chinese speaker. I can't understand the video you shared

TaoRuijie commented 7 months ago

Ok sure,

For speaker recognition, during training for each epoch (say batch size is 200), model needs to load 200 utterances from the hard disk. In my code we have data augmentation, which load at least one noisy .wav file from the hard disk. Then training has two steps: 1. Load these data, add clean speech and noise for augmentation (CPU); 2. Feed these data into the model for training, update the params(GPU).
If you check the usage of your GPU (watch -n 0.1 nvidia-smi), it might has a periodly low usage percentage. Because for most of the time, your server is doing step 1 for CPU, loading the data, so GPU is empty and is waiting for data, not training, -> your training is very slow. .
So the reason might be that your data is in HDD disk, instead of SDD, which leads to the io loading issue. Also you can remove the 'babble' and 'tv' augmentation method in dataloader to reduce the augmentation stress.

varun-krishnaps commented 6 months ago

Thanks a lot Tao....Your solution worked....an Epoch takes only 20mins now

TaoRuijie / Loss-Gated-Learning

The training speed is too slow with same configuration on 4090 #10