flashlight / wav2letter

Facebook AI Research's Automatic Speech Recognition Toolkit
https://github.com/facebookresearch/wav2letter/wiki
Other
6.39k stars 1.01k forks source link

Stuck when running on multi-GPU #778

Open tryanbot opened 4 years ago

tryanbot commented 4 years ago

Question

Hi, could someone help me why my train is stucked when using multi-gpu?

Additional Context

image image

Stuck, no training iteration done (I am using 1 iteration only)

here are the .cfg --datadir=/home/user/data/audio/ --rundir=/home/user/data/audio/ --archdir=/home/user/dev/wav2letter/tutorials/1-librispeech_clean/ --train=lists/train-clean-100.lst --valid=lists/dev-clean.lst --input=flac --arch=network.arch --tokens=/home/user/data/audio/am/tokens.txt --lexicon=/home/user/data/audio/am/lexicon.txt --criterion=ctc --lr=0.1 --maxgradnorm=1.0 --replabel=1 --surround=| --onorm=target --sqnorm=true --mfsc=true --filterbanks=40 --nthread=14 --batchsize=32 --runname=librispeech_clean_trainlogs --iter=1 --logtostderr=1 --minloglevel=0 --enable_distributed=True --reportiters=1

tlikhomanenko commented 4 years ago

What happens in your log while hanging? seems your batch size is very large, could you try 4 just for test?

tryanbot commented 4 years ago

I am be able to start the training with only 1 gpu, so I think the batch size is not a problem. but I tried the batch_size 4, and it's still hanging. The log with multiple GPU is empty. like no running had happened image while the log with only 1 gpu is image

it seems that multi-gpu training cannot start the training, only book the computing resources Please help

tryanbot commented 4 years ago

additional information, maybe it can help (or not) image this is the message when I interupt the process (after hanging for a while)

Dr-AyanDebnath commented 4 years ago

My suggestion @tryanbot would be to try by increasing the iter (I saw that u tried with reducing batchsize to 4).

tryanbot commented 4 years ago

I tried by using 1 million before. the iteration number may not be a problem because the script run well in 1 gpu (regardless batchsize and iteration).

Dr-AyanDebnath commented 4 years ago

are you using similar command to run in gpu ? mpirun --allow-run-as-root -n 4 /root/wav2letter/build/Train train -enable_distributed true --flagsfile /home/train.cfg --minloglevel=0 --logtostderr=1

tryanbot commented 4 years ago

@Dr-AyanDebnath you saw my initial message right? I explain my command there. anyway, I tried your command image still stuck image only book the gpu but not running I think the problem maybe on dependencies, please help

tlikhomanenko commented 4 years ago

Could you run just tests for the flashlight (go to build and run make test)? There are a tests for distributed things, just to be sure they are working for you.

cc @jacobkahn

tryanbot commented 4 years ago

could you please send me the link material to read so I can do that by my self?

edit: sorry for not reading your comment carefully @tlikhomanenko image this is the result

tryanbot commented 4 years ago

Update : I solve the failed test case, now the result is image However the multigpu training process is still stucked. I cant see any distributed gpu test on the test case. Please help @tlikhomanenko

tlikhomanenko commented 4 years ago

Could you check if in flashlight dir all test pass? Distributed test is in flashlight, not in wav2letter.

tryanbot commented 4 years ago

okay, all passed in flashlight image

tlikhomanenko commented 4 years ago

Probably the reason in mpi itself. @jacobkahn any idea on this?

jacobkahn commented 4 years ago

@tryanbot — can you run the AllReduceTest with mpirun in the same way you'd start training? Something like

mpirun --allow-run-as-root -n 2 ./AllReduceTest train --enable_distributed true --logtostderr=1
tryanbot commented 4 years ago

Also stuck image image Please help for further assistance @jacobkahn

jacobkahn commented 4 years ago

@tryanbot — this seems like an issue with your setup or other dependencies. Can you build and run the tests here and see what happens? https://github.com/NVIDIA/nccl-tests

tryanbot commented 4 years ago

stuck on here image any idea why this happens @jacobkahn ?

tryanbot commented 4 years ago

any update on this @jacobkahn @tlikhomanenko ? is there any insight on why and how to solve this?

tlikhomanenko commented 4 years ago

Ok, seems it is not related to the flashlight and wav2letter itself. Could you try to search this issue at NVIDIA and report them how debug/fix it?