Closed ML6634 closed 3 years ago
@ML6634: Your --valid
flag is empty. Should be something like:
--valid=dev-clean:/home/w2luser/w2l/lists/dev-clean.lst,dev-other:/home/w2luser/w2l/lists/dev-other.lst
Thank @abhinavkulkarni for the helpful comment, which has taken care of the issue!
(1) After that, I got:
'ArrayFire Exception (Device out of memory:101)
On my computer, for the default
--train=/root/w2l/lists/train-clean-100.lst,/root/w2l/lists/train-clean-360.lst,/root/w2l/lists/train-other-500.lst
the training went through. Now I have replaced it by 20 telephony audios, each of which is around 10 minutes long. I think the training data is smaller. Why does it cause
'ArrayFire Exception (Device out of memory:101)
?
(2) If I reduce batchsize to 1, the "out of memory" issue is gone. However, I have got:
Falling back to using letters as targets for the unknown word: mhm Falling back to using letters as targets for the unknown word: mhm terminate called after throwing an instance of 'std::runtime_error' what(): Error: compute_ctc_loss, stat = label length >639 is not supported *** Aborted at 1606106683 (unix time) try "date -d @1606106683" if you are using GNU date *** PC: @ 0x7f346b24ce97 gsignal *** SIGABRT (@0xd6) received by PID 214 (TID 0x7f34b0b1e380) from PID 214; stack trace: *** @ 0x7f34a8e32890 (unknown) @ 0x7f346b24ce97 gsignal @ 0x7f346b24e801 abort @ 0x7f346bc41957 (unknown) @ 0x7f346bc47ab6 (unknown) @ 0x7f346bc47af1 std::terminate() @ 0x7f346bc47d24 __cxa_throw @ 0x563f69b3681f w2l::(anonymous namespace)::throw_on_error() @ 0x563f69b37a16 w2l::ConnectionistTemporalClassificationCriterion::forward() @ 0x563f699d3d30 _ZZ4mainENKUlSt10shared_ptrIN2fl6ModuleEES_IN3w2l17SequenceCriterionEES_INS3_10W2lDatasetEES_INS0_19FirstOrderOptimizerEES9_ddblE3_clES2_S5_S7_S9_S9_ddbl @ 0x563f699674d8 main @ 0x7f346b22fb97 __libc_start_main @ 0x563f699cde4a _start Aborted (core dumped)
Any comments about this bug? Is it because the duration of my audios, around 10 minutes, is too long? Thank you!
@ML6634: This means you are running low on the GPU memory. You can monitor the GPU memory usage by running watch nvidia-smi
command. This is not a bug, but simply a limitation of resources on your system.
I have:
ml@ml-Alienware-Aurora-Ryzen-Edition:~$ nvidia-smi Mon Nov 23 01:54:01 2020 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... On | 00000000:0B:00.0 On | N/A | | 18% 31C P8 1W / 250W | 381MiB / 11011MiB | 2% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 3806 G /usr/lib/xorg/Xorg 35MiB | | 0 N/A N/A 6870 G /usr/lib/xorg/Xorg 260MiB | | 0 N/A N/A 7098 G /usr/bin/gnome-shell 43MiB | | 0 N/A N/A 8585 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 8885 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 9930 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 12009 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 22122 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 28791 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 161973 G /usr/lib/firefox/firefox 3MiB | | 0 N/A N/A 171677 G /usr/lib/firefox/firefox 3MiB | +-----------------------------------------------------------------------------+
Any way for me to still run the training? Thank you!
@ML6634: Run the command watch nvidia-smi
while the training/decoding loop is going on. The watch
prefix updates the command output every so often, so as your data is being processed in the GPU, you can see the progression of memory usage and verify that indeed at some point, your batch exceeds the GPU memory.
Thank @abhinavkulkarni for the help! I ran the training quite a few times. For most of them, I saw that the training took 9027 MiB GPU for the highest number which showed up. For some runs, it could be as high as 9357 MiB GPU.
On my computer, for the default
--train=/root/w2l/lists/train-clean-100.lst,/root/w2l/lists/train-clean-360.lst,/root/w2l/lists/train-other-500.lst
the training went through even for
--batchsize=4Now I am training it using 20 telephony audios instead, each of which is around 10 minutes long. I think the training data is smaller. Why does it cause a GPU memory issue? Any possible way for me to taking care of it? Thank you!
Now I am training it using 20 telephony audios instead
Did you mean a mini-batch size of 20? If so, you may want to round it a power of 2 - such as 2, 4, 16, 32, etc.
Any possible way for me to taking care of it?
You can try splitting your audio into chunks of 15-45s as described here: https://github.com/facebookresearch/wav2letter/issues/797#issuecomment-686875994
Thank @abhinavkulkarni for the help!
Did you mean a mini-batch size of 20? If so, you may want to round it a power of 2 - such as 2, 4, 16, 32, etc.
I just meant that my
--train
is a list of 20 audios. The duration of each of these 20 audios is around 10 minutes. Which one you suggest me to round to a power of 2?
I plan to split the audios and transcripts to chunks of 15-45 seconds. Any software or way would you like to recommend me to use for splitting them? Thank you!
The problem is not the size of your train, the problem is in the batch size. For Librispeech mostly every audio is less than 36 seconds. So now you can compute that if on Librispeech you had batchsize=6 for one GPU total audio samples duration was less than 3.6 min. And now even with one audio sample (batchsize=1) you probably will have OOM because it is 10min. Either you can use batchsize=1 (if 10min audio can fit into memory) or do segmentation of original audio into chunks. You can use our tools for this here https://github.com/facebookresearch/wav2letter/tree/v0.2/tools#voice-activity-detection-with-ctc--an-n-gram-language-model
Thank @tlikhomanenko for the helpful comments!
Closing, feel free to create another issues if it is needed.
I am running Resnet CTC training using user's audios. I am using a very small number of audios for testing for now. I have got an bug:
Any ideas about that? Thank you!