WeidiXie / VGG-Speaker-Recognition

Utterance-level Aggregation For Speaker Recognition In The Wild
362 stars 98 forks source link

Training is quite slow. #33

Closed alamnasim closed 5 years ago

alamnasim commented 5 years ago

I am also facing the isse, as model trained very slowly. I run other codes and projects on same gpu and they are running fine, gpu has been used but VGG-Speaker runs slowly. I tried it on two NVIDIA GTX-1060 installed in my computer and P100 on google-cloud as well.

I tried everything to resolve this issue but not succeed.

Epoch 1/10 Learning rate for epoch 1 is 0.0001. 17/305810 [..............................] - ETA: 3354:00:12 - loss: 0.8716 - acc: 0.9531

Please help. Thanks.

WeidiXie commented 5 years ago

Hi,

alamnasim commented 5 years ago

Thanks for your prompt reply.

The main problem that I am facing is, training does not use GPU, while another project/code can use the same GPU. Do I need to change somewhere in the code?

moniGra commented 5 years ago

Have you changed --gpu 2,3 option in the training command to make it aligned with the gpu set-up you have on your machine? I have just one gpu, so for me the proper set-up is --gpu 0, otherwise it cannot detect gpu and runs on cpu. --gpu sets the CUDA_VISIBLE_DEVICES variable - you can google this one to get more information on what you should specify there.

alamnasim commented 5 years ago

Have you changed --gpu 2,3 option in the training command to make it aligned with the gpu set-up you have on your machine? I have just one gpu, so for me the proper set-up is --gpu 0, otherwise it cannot detect gpu and runs on cpu. --gpu sets the CUDA_VISIBLE_DEVICES variable - you can google this one to get more information on what you should specify there.

I have 2 1060 GPU so i changed option to --gpu 1,2 and --gpu 0 and also --gpu 1, but in none of the case i found gpu detected. I also google regarding this, but problem not solved.

moniGra commented 5 years ago

Have you changed --gpu 2,3 option in the training command to make it aligned with the gpu set-up you have on your machine? I have just one gpu, so for me the proper set-up is --gpu 0, otherwise it cannot detect gpu and runs on cpu. --gpu sets the CUDA_VISIBLE_DEVICES variable - you can google this one to get more information on what you should specify there.

I have 2 1060 GPU so i changed option to --gpu 1,2 and --gpu 0 and also --gpu 1, but in none of the case i found gpu detected. I also google regarding this, but problem not solved.

try maybe with --gpu 0,1 as far as I remember the number to use is the gpu number - starting from 0. However, --gpu 0 and --gpu 1 should work anyway in your case... I was trying this code with test mode only but gpu was working. Is nvidia-smi working? If not, try restarting the machine

mmxuan18 commented 5 years ago

maybe the slow is cause by the dataloader, as i tested librosa load wav and compute spect is very slow compare to scipy.io.wavefile and signal, or direct use tensorflow's stft. another points is can first select 3s wav then compute spect, not first use whole wav to compute spect then select 300 frame.

xiaomeilu commented 5 years ago

Hi, I want to check whether my training speed is reasonable?

117/7492 [..............................] - ETA: 8:24:28 - loss: 11.4793 - acc: 0.7942

My config is almost same as the training command in README except that 8 Titan 1080 ti GPUs are used, multiprocess is 32 and loss is amsoftmax (the optimal speed setting I try). To finish 128 epochs training on VoxCeleb2 needs 42 days (3 epochs per day). The time seems take too long. Does it run correctly? I check that the GPU-Util always stay 0% and it seems that the speed bottleneck is the data preprocessing.

Another question is that what acc will reach on val set of VoxCeleb2 when the model is converged?

P.S. Thanks for this nice work

WeidiXie commented 5 years ago

Hi,

I don't think this speed is correct..... It's too slow, I would recommend you to edit the code slightly as @mlinxiang mentioned,

Sorry, I'm busy this week, but I can try to edit the code next week on improving the speed.

But I don't quite understand why this can be so slow on different machines, on my machine, each epoch takes about 2-3 hours.

The final training accuracy should be around 91-92%.

Again, this is very initial work, there's a lot room for improvements.

Best, Weidi