Open XuanJiang023 opened 4 years ago
Very interesting work! I notice the line
/usr/local/miniconda3/lib/python3.8/site-packages/chainer/backends/cuda.py:142: UserWarning: cuDNN is not enabled.
Please reinstall CuPy after you install cudnn
Which is perhaps something to explore. What is chainer for in your configuration? You may also try moving to a newer version of CUDA. CUDA 10.0 is fairly old and likely predates LMS altogether. :)
@jayfurmanek The warning message is about "chainer" backend of espnet, but I use "pytorch" backend. So it does not matter. I also upgrade the cuda to 10.2, and compile pytorch 1.5 with it, but it still not work. The error message is same.
Hi,
While doing some experiments on espnet, an asr framework with pytorch backend, the memory of gpu is not enough. So I installed the LMS for pytorch, and it works well except the tranducer network. It seems that the error is about lms, the exception is as following:
asr_train.py --config conf/tuning/transducer/train_transducer.yaml --preprocess-conf conf/specaug.yaml --ngpu 8 --backend pytorch --outdir exp/train_960_pytorch_train_transducer_specaug/results --tensorboard-dir tensorboard/train_960_pytorch_train_transducer_specaug --debugmode 1 --dict data/lang_char/train_960_unigram5000_units.txt --debugdir exp/train_960_pytorch_train_transducer_specaug --minibatches 0 --verbose 0 --resume --n-iter-processes 8 --train-json dump/train_960/deltafalse/data_unigram5000.json --valid-json dump/dev/deltafalse/data_unigram5000.json
Started at Mon Oct 19 01:32:06 UTC 2020
# /usr/local/miniconda3/lib/python3.8/site-packages/chainer/backends/cuda.py:142: UserWarning: cuDNN is not enabled. Please reinstall CuPy after you install cudnn (see https://docs-cupy.chainer.org/en/stable/install.html#install-cudnn). warnings.warn( 2020-10-19 01:32:07,749 (asrtrain:561) WARNING: Skip DEBUG/INFO messages 2020-10-19 01:32:09,098 (asr:481) WARNING: batch size is automatically increased (16 -> 128) terminate called after throwing an instance of 'c10::Error' what(): state == State::kActive INTERNAL ASSERT FAILED at ../c10/core/LargeModelSupport.h:94, please report a bug to PyTorch. (unpin at ../c10/core/LargeModelSupport.h:94) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6c (0x7f5f893c340c in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so)
frame #1: + 0x398b173 (0x7f5f8eecf173 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #2: + 0x6101da8 (0x7f5f91645da8 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #3: + 0x5960519 (0x7f5f90ea4519 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #4: + 0x58e2b71 (0x7f5f90e26b71 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #5: + 0x5960519 (0x7f5f90ea4519 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so)
frame #6: + 0x44cc4f (0x7f5fa384cc4f in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)