Open tryanbot opened 4 years ago
What happens in your log while hanging? seems your batch size is very large, could you try 4 just for test?
I am be able to start the training with only 1 gpu, so I think the batch size is not a problem. but I tried the batch_size 4, and it's still hanging. The log with multiple GPU is empty. like no running had happened while the log with only 1 gpu is
it seems that multi-gpu training cannot start the training, only book the computing resources Please help
additional information, maybe it can help (or not) this is the message when I interupt the process (after hanging for a while)
My suggestion @tryanbot would be to try by increasing the iter (I saw that u tried with reducing batchsize to 4).
I tried by using 1 million before. the iteration number may not be a problem because the script run well in 1 gpu (regardless batchsize and iteration).
are you using similar command to run in gpu ? mpirun --allow-run-as-root -n 4 /root/wav2letter/build/Train train -enable_distributed true --flagsfile /home/train.cfg --minloglevel=0 --logtostderr=1
@Dr-AyanDebnath you saw my initial message right? I explain my command there. anyway, I tried your command still stuck only book the gpu but not running I think the problem maybe on dependencies, please help
Could you run just tests for the flashlight (go to build and run make test)? There are a tests for distributed things, just to be sure they are working for you.
cc @jacobkahn
could you please send me the link material to read so I can do that by my self?
edit: sorry for not reading your comment carefully @tlikhomanenko this is the result
Update : I solve the failed test case, now the result is However the multigpu training process is still stucked. I cant see any distributed gpu test on the test case. Please help @tlikhomanenko
Could you check if in flashlight dir all test pass? Distributed test is in flashlight, not in wav2letter.
okay, all passed in flashlight
Probably the reason in mpi itself. @jacobkahn any idea on this?
@tryanbot — can you run the AllReduceTest
with mpirun
in the same way you'd start training? Something like
mpirun --allow-run-as-root -n 2 ./AllReduceTest train --enable_distributed true --logtostderr=1
Also stuck Please help for further assistance @jacobkahn
@tryanbot — this seems like an issue with your setup or other dependencies. Can you build and run the tests here and see what happens? https://github.com/NVIDIA/nccl-tests
stuck on here any idea why this happens @jacobkahn ?
any update on this @jacobkahn @tlikhomanenko ? is there any insight on why and how to solve this?
Ok, seems it is not related to the flashlight and wav2letter itself. Could you try to search this issue at NVIDIA and report them how debug/fix it?
Question
Hi, could someone help me why my train is stucked when using multi-gpu?
Additional Context
Stuck, no training iteration done (I am using 1 iteration only)
here are the .cfg --datadir=/home/user/data/audio/ --rundir=/home/user/data/audio/ --archdir=/home/user/dev/wav2letter/tutorials/1-librispeech_clean/ --train=lists/train-clean-100.lst --valid=lists/dev-clean.lst --input=flac --arch=network.arch --tokens=/home/user/data/audio/am/tokens.txt --lexicon=/home/user/data/audio/am/lexicon.txt --criterion=ctc --lr=0.1 --maxgradnorm=1.0 --replabel=1 --surround=| --onorm=target --sqnorm=true --mfsc=true --filterbanks=40 --nthread=14 --batchsize=32 --runname=librispeech_clean_trainlogs --iter=1 --logtostderr=1 --minloglevel=0 --enable_distributed=True --reportiters=1