Closed apldev3 closed 5 years ago
It seems to be an OOM issue. Are you able to finish Test.cpp for the whole dataset? Also, can you paste the full stderr output of decoding here instead of only the error.
I run watch nvidia-smi
as I'm training so I can keep an eye on memory usage. You could try decreasing the thread count?
@lunixbochs it's funny you mention that as that's exactly what I landed upon. I decreased the nthread_decoder count from 8 to 1, and then did the watch nvidia-smi
and for some reason that keeps it alive. Someone I work with is guessing that watching nvidia-smi
is causing it to make repeated small, lightweight calls to the GPU via CUDA which is keeping some channel open that otherwise closes.
I could believe that since it hits a certain state in the watch nvidia-smi
where it no longer appears to be doing anything other than holding a constant chunk of VRAM in the GPU (about 1GB worth). While it's holding that amount the decoding is still occurring in the background.
As to your request @xuqiantong I can look into it briefly tomorrow. Not sure that I'll have the bandwidth to run Test.cpp for the dataset, but I'll look into getting the full stderr for you.
@xuqiantong apologies for not following up soon, and double apologies for not having better news. I dump both the stdout and stderr to the same file and what you see above is all that's dumped out in terms of a stack trace. Here it is again. Like we've found out in my previous comment. This doesn't occur when you shift it down 1 Decoder thread, so my new guess is that something must not be thread safe. More specifically, it seems that it's somehow exhausting the Flashlight thread pool.
terminate called after throwing an instance of 'std::system_error'
what(): Resource temporarily unavailable
*** Aborted at 1556310352 (unix time) try "date -d @1556310352" if you are using GNU date ***
PC: @ 0x7fe1d6e2b428 gsignal
*** SIGABRT (@0x6452) received by PID 25682 (TID 0x7fe257cc9fc0) from PID 25682; stack trace: ***
@ 0x7fe23bd0b390 (unknown)
@ 0x7fe1d6e2b428 gsignal
@ 0x7fe1d6e2d02a abort
@ 0x7fe1d799084d __gnu_cxx::__verbose_terminate_handler()
@ 0x7fe1d798e6b6 (unknown)
@ 0x7fe1d798e701 std::terminate()
@ 0x7fe1d798e969 __cxa_rethrow
@ 0x47b6b6 _ZNSt6vectorISt6threadSaIS0_EE19_M_emplace_back_auxIJZN2fl10ThreadPoolC4EmRKSt8functionIFvmEEEUlvE_EEEvDpOT_
@ 0x47b96a fl::ThreadPool::ThreadPool()
@ 0x419c28 main
@ 0x7fe1d6e16830 __libc_start_main
@ 0x46bfe9 _start
@ 0x0 (unknown)
./decode_and_score.sh: line 97: 25682 Aborted (core dumped) build/Decoder --flagsfile $DECODE_CFG
Hi @apldev3, GPUs are only used when running the forward pass and generating emissions for raw audios. After we get all the emissions (probabilities of each token for each frame), we will never use GPUs anymore during decoding, no matter how many decode_thread
you use.
So can you please check:
1) If Test.cpp works for your full dataset. If yes, we know that your memory is large enough to hold all the emissions. Can you also check the peak memory usage during test.cpp running as well as the final emission set size?
2) Loading the emission set into decoder instead of the acoustic model, and check when decoder crash when you increase the number of decoder threads. Decoding is memory heavy, so one possible solution is to reduce beam size and use fewer threads. If you want to know more details, please paste the full stderr log here instead of the error stack trace.
Closing due to inactivity. Feel free to reopen if needed.
Hey guys, I'm seeing another error in a similar flavor to #234 which was resolved by doing a pull from
master
. Unfortunately, since the last time I pulled (April 3rd) there doesn't seem to have been any updates that may resolve the new error that I'm seeing.Presently, I'm attempting to decode with a model trained with the ASG criterion on the VoxForge Russian Dataset and using a KenLM language model trained on 100% of the OPUS MultiUN monolingual untokenized Russian plaintext file (that being said there is a LOT of preprocessing involved before actually throwing it over to KenLM).
When attempting to do some decoding using this combination, I get a std::system_error in the form of
My current config file looks like:
Since I'm using a GPU backend, my best guess is that I've run out of memory on the GPU? Also, I was bumping into this issue when doing a decode using only 25% of my available data as opposed to 50%, but then I finally got it to finish. Since I can no longer seem to workaround this error it seems to add more evidence to the OOM issue for the GPU.
Has anyone else bumped into this, and does anyone have any ideas for a workaround?