std::system_error while Decoding

apldev3 commented 5 years ago

Hey guys, I'm seeing another error in a similar flavor to #234 which was resolved by doing a pull from master. Unfortunately, since the last time I pulled (April 3rd) there doesn't seem to have been any updates that may resolve the new error that I'm seeing.

Presently, I'm attempting to decode with a model trained with the ASG criterion on the VoxForge Russian Dataset and using a KenLM language model trained on 100% of the OPUS MultiUN monolingual untokenized Russian plaintext file (that being said there is a LOT of preprocessing involved before actually throwing it over to KenLM).

When attempting to do some decoding using this combination, I get a std::system_error in the form of

terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable Aborted at 1555266825 (unix time) try "date -d @1555266825" if you are using GNU date PC: @ 0x7f81a68d9428 gsignal SIGABRT (@0x6b9) received by PID 1721 (TID 0x7f8227777fc0) from PID 1721; stack trace: @ 0x7f820b7b9390 (unknown) @ 0x7f81a68d9428 gsignal @ 0x7f81a68db02a abort @ 0x7f81a743e84d gnu_cxx::verbose_terminate_handler() @ 0x7f81a743c6b6 (unknown) @ 0x7f81a743c701 std::terminate() @ 0x7f81a743c969 cxa_rethrow @ 0x477b36 _ZNSt6vectorISt6threadSaIS0_EE19_M_emplace_back_auxIJZN2fl10ThreadPoolC4EmRKSt8functionIFvmEEEUlvEEEEvDpOT @ 0x477dea fl::ThreadPool::ThreadPool() @ 0x4193d8 main @ 0x7f81a68c4830 libc_start_main @ 0x468a59 _start @ 0x0 (unknown)

My current config file looks like:

--datadir=/shared/install/wav2letter/w2l_ru/voxforge50 --tokens=tokens.txt --lexicon=/shared/install/wav2letter/w2l_ru/lm/lexicon.txt --lm=/shared/install/wav2letter/w2l_ru/lm/MultiUN100.blm --am=/shared/install/wav2letter/w2l_ru/train/VoxForge_50/001_model_validation.bin --test=test --sclite=/shared/install/wav2letter/w2l_ru/decode/vf_ru_50 --lmweight=2.5 --wordscore=1 --beamsize=1000 --beamscore=25 --silweight=-0.5 --nthread=4 --nthread_decoder=8 --smearing=max --show

Since I'm using a GPU backend, my best guess is that I've run out of memory on the GPU? Also, I was bumping into this issue when doing a decode using only 25% of my available data as opposed to 50%, but then I finally got it to finish. Since I can no longer seem to workaround this error it seems to add more evidence to the OOM issue for the GPU.

Has anyone else bumped into this, and does anyone have any ideas for a workaround?

xuqiantong commented 5 years ago

It seems to be an OOM issue. Are you able to finish Test.cpp for the whole dataset? Also, can you paste the full stderr output of decoding here instead of only the error.

lunixbochs commented 5 years ago

I run watch nvidia-smi as I'm training so I can keep an eye on memory usage. You could try decreasing the thread count?

apldev3 commented 5 years ago

@lunixbochs it's funny you mention that as that's exactly what I landed upon. I decreased the nthread_decoder count from 8 to 1, and then did the watch nvidia-smi and for some reason that keeps it alive. Someone I work with is guessing that watching nvidia-smi is causing it to make repeated small, lightweight calls to the GPU via CUDA which is keeping some channel open that otherwise closes.

I could believe that since it hits a certain state in the watch nvidia-smi where it no longer appears to be doing anything other than holding a constant chunk of VRAM in the GPU (about 1GB worth). While it's holding that amount the decoding is still occurring in the background.

As to your request @xuqiantong I can look into it briefly tomorrow. Not sure that I'll have the bandwidth to run Test.cpp for the dataset, but I'll look into getting the full stderr for you.

apldev3 commented 5 years ago

@xuqiantong apologies for not following up soon, and double apologies for not having better news. I dump both the stdout and stderr to the same file and what you see above is all that's dumped out in terms of a stack trace. Here it is again. Like we've found out in my previous comment. This doesn't occur when you shift it down 1 Decoder thread, so my new guess is that something must not be thread safe. More specifically, it seems that it's somehow exhausting the Flashlight thread pool.

terminate called after throwing an instance of 'std::system_error'
  what():  Resource temporarily unavailable
*** Aborted at 1556310352 (unix time) try "date -d @1556310352" if you are using GNU date ***
PC: @     0x7fe1d6e2b428 gsignal
*** SIGABRT (@0x6452) received by PID 25682 (TID 0x7fe257cc9fc0) from PID 25682; stack trace: ***
    @     0x7fe23bd0b390 (unknown)
    @     0x7fe1d6e2b428 gsignal
    @     0x7fe1d6e2d02a abort
    @     0x7fe1d799084d __gnu_cxx::__verbose_terminate_handler()
    @     0x7fe1d798e6b6 (unknown)
    @     0x7fe1d798e701 std::terminate()
    @     0x7fe1d798e969 __cxa_rethrow
    @           0x47b6b6 _ZNSt6vectorISt6threadSaIS0_EE19_M_emplace_back_auxIJZN2fl10ThreadPoolC4EmRKSt8functionIFvmEEEUlvE_EEEvDpOT_
    @           0x47b96a fl::ThreadPool::ThreadPool()
    @           0x419c28 main
    @     0x7fe1d6e16830 __libc_start_main
    @           0x46bfe9 _start
    @                0x0 (unknown)
./decode_and_score.sh: line 97: 25682 Aborted                 (core dumped) build/Decoder --flagsfile $DECODE_CFG

xuqiantong commented 5 years ago

Hi @apldev3, GPUs are only used when running the forward pass and generating emissions for raw audios. After we get all the emissions (probabilities of each token for each frame), we will never use GPUs anymore during decoding, no matter how many decode_thread you use. So can you please check: 1) If Test.cpp works for your full dataset. If yes, we know that your memory is large enough to hold all the emissions. Can you also check the peak memory usage during test.cpp running as well as the final emission set size? 2) Loading the emission set into decoder instead of the acoustic model, and check when decoder crash when you increase the number of decoder threads. Decoding is memory heavy, so one possible solution is to reduce beam size and use fewer threads. If you want to know more details, please paste the full stderr log here instead of the error stack trace.

jacobkahn commented 5 years ago

Closing due to inactivity. Feel free to reopen if needed.

flashlight / wav2letter

std::system_error while Decoding #269