IBM / pytorch-large-model-support

Large Model Support in PyTorch
Apache License 2.0
132 stars 19 forks source link

LMS does not work for tranducer network #8

Open XuanJiang023 opened 4 years ago

XuanJiang023 commented 4 years ago

Hi,

While doing some experiments on espnet, an asr framework with pytorch backend, the memory of gpu is not enough. So I installed the LMS for pytorch, and it works well except the tranducer network. It seems that the error is about lms, the exception is as following:

asr_train.py --config conf/tuning/transducer/train_transducer.yaml --preprocess-conf conf/specaug.yaml --ngpu 8 --backend pytorch --outdir exp/train_960_pytorch_train_transducer_specaug/results --tensorboard-dir tensorboard/train_960_pytorch_train_transducer_specaug --debugmode 1 --dict data/lang_char/train_960_unigram5000_units.txt --debugdir exp/train_960_pytorch_train_transducer_specaug --minibatches 0 --verbose 0 --resume --n-iter-processes 8 --train-json dump/train_960/deltafalse/data_unigram5000.json --valid-json dump/dev/deltafalse/data_unigram5000.json

Started at Mon Oct 19 01:32:06 UTC 2020

# /usr/local/miniconda3/lib/python3.8/site-packages/chainer/backends/cuda.py:142: UserWarning: cuDNN is not enabled. Please reinstall CuPy after you install cudnn (see https://docs-cupy.chainer.org/en/stable/install.html#install-cudnn). warnings.warn( 2020-10-19 01:32:07,749 (asrtrain:561) WARNING: Skip DEBUG/INFO messages 2020-10-19 01:32:09,098 (asr:481) WARNING: batch size is automatically increased (16 -> 128) terminate called after throwing an instance of 'c10::Error' what(): state == State::kActive INTERNAL ASSERT FAILED at ../c10/core/LargeModelSupport.h:94, please report a bug to PyTorch. (unpin at ../c10/core/LargeModelSupport.h:94) frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&) + 0x6c (0x7f5f893c340c in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libc10.so) frame #1: + 0x398b173 (0x7f5f8eecf173 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #2: + 0x6101da8 (0x7f5f91645da8 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #3: + 0x5960519 (0x7f5f90ea4519 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #4: + 0x58e2b71 (0x7f5f90e26b71 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #5: + 0x5960519 (0x7f5f90ea4519 in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch.so) frame #6: + 0x44cc4f (0x7f5fa384cc4f in /usr/local/miniconda3/lib/python3.8/site-packages/torch/lib/libtorch_python.so)

frame #40: __libc_start_main + 0xe7 (0x7f5fb3f54b97 in /lib/x86_64-linux-gnu/libc.so.6) # Accounting: time=6 threads=1 # Ended (code 134) at Mon Oct 19 01:32:12 UTC 2020, elapsed time 6 seconds Python version is 3.8.3, and pytorch version is 1.4.0, cuda version is 10.0
jayfurmanek commented 4 years ago

Very interesting work! I notice the line

/usr/local/miniconda3/lib/python3.8/site-packages/chainer/backends/cuda.py:142: UserWarning: cuDNN is not enabled.
Please reinstall CuPy after you install cudnn

Which is perhaps something to explore. What is chainer for in your configuration? You may also try moving to a newer version of CUDA. CUDA 10.0 is fairly old and likely predates LMS altogether. :)

XuanJiang023 commented 4 years ago

@jayfurmanek The warning message is about "chainer" backend of espnet, but I use "pytorch" backend. So it does not matter. I also upgrade the cuda to 10.2, and compile pytorch 1.5 with it, but it still not work. The error message is same.