Closed chiayuli closed 6 years ago
Does this only happen in decode.2.log
, and others are fine? Then, I'm expecting this is just a problem of some accidental data access. If this always happens, it would be due to some bugs. Can you just re-run only decoding by adding the option of --stage 5
?
No, this happen in all decode.*.log. After I re-run it (adding the option of --stage 5), and it happened again.
Thanks. I'll check it.
I did not test the completely same setup, but I did not observe the issue. The training may have some issues. Can you take a look at exp/.../train.log? Also, can you check the model exists at exp/.../results/model.acc.best ?
@chiayuli @sw005320
This is caused by the torch.nn.DataParallel
.
We have to change the saving function when using DataParallel
as follows:
(This is from my another project's codes)
if args.n_gpus > 1:
torch.save({"model": model.module.state_dict()}, args.expdir + "/checkpoint-final.pkl")
else:
torch.save({"model": model.state_dict()}, args.expdir + "/checkpoint-final.pkl")
@bobchennan Coud you fix it?
Thanks, I'll try it and feedback to you.
Yes that is caused by DataParallel. I will fix it soon.
Hi all, I modified the code in asr_pytorch.py as #157 But it occurs other error (torch_load) during training acoustic model. Is there any modification to torch_load(path, obj) function? Many Thanks
=== commands ===
export CUDA_VISIBLE_DEVICES=0,2,3 ; nohup ./run.sh --ngpu 3 --stage 4 --backend pytorch --etype blstmp >> run.log&
=== log ===
Exception in main training loop: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'
Traceback (most recent call last):
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run
entry.extension(self)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot
_restore_snapshot(model, snapshot, load_fn)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot
load_fn(snapshot, model)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load
model.load_state_dict(torch.load(path))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
Will finalize trainer extensions and updater before reraising the exception.
ESC[JTraceback (most recent call last):
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 196, in
@chiayuli new updates of #157 should fix it.
Still for PyTorch Multi-GPU I think there are some problems. I would suggest to merge with #155 and we may test it as soon as possible.
@bobchennan There is still an error when loading the model trained with multi-gpu.
def remove_dataparallel(state_dict):
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if k.startswith("module."):
name = k[7:]
new_state_dict[name] = v
return new_state_dict
This should be
def remove_dataparallel(state_dict):
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
if k.startswith("module."):
name = k[7:]
new_state_dict[name] = v
else:
new_state_dict[k] = v
return new_state_dict
I will make PR to fix it.
It is included in #173 :
for k, v in state_dict.items():
if k.startswith("module."):
k = k[7:]
new_state_dict[k] = v
but I agree it is better to make a separate pull request and merge it as soon as possible.
Oh sorry, I overlooked it. I will merge fixing PR.
Now fixed.
Hi all, I'm the beginner for ESpnet and I followed the instructions from ESpnet/egs/wsj/asr1/run.sh the training of language model and acoustic model looks fine,
export CUDA_VISIBLE_DEVICES=0 ; ./run.sh --ngpu 1 --backend pytorch --etype blstmp
but I face problem during decoding... the following message is from decode.2.log Does anyone face the same problem ? Any suggestion will be appreciated. Thanks.
2018-05-04 15:26:04,898 (asr_recog:97) INFO: python path = /mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/: 2018-05-04 15:26:04,898 (asr_recog:102) INFO: set random seed = 1 2018-05-04 15:26:04,898 (asr_recog:105) INFO: backend = pytorch 2018-05-04 15:28:18,098 (asr_pytorch:314) INFO: reading a model config file fromexp/train_si284_503-lm/results/model.conf 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: backend: pytorch 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: beam_size: 20 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: ctc_weight: 0.3 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: debugmode: 1 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: gpu: None 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: lm_weight: 1.0 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: maxlenratio: 0.0 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: minlenratio: 0.0 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: model: exp/train_si284_503-lm/results/model.acc.best 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: model_conf: exp/train_si284_503-lm/results/model.conf 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: nbest: 1 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: ngpu: 0 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: penalty: 0.0 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: recog_feat: ark,s,cs:apply-cmvn --norm-vars=true data/train_si284/cmvn.ark scp:data/test_dev93/split32utt/2/feats.scp ark:- | 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: recog_label: data/test_dev93/data.json 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: result_label: exp/train_si284_503-lm/decode_test_dev93_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm1.0/data.2.json 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: rnnlm: exp/train_rnnlm_2layer_bs2048/rnnlm.model.best 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: seed: 1 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: verbose: 1 2018-05-04 15:28:18,107 (asr_pytorch:321) INFO: reading model parameters fromexp/train_si284_503-lm/results/model.acc.best 2018-05-04 15:28:18,108 (e2e_asr_attctc_th:170) INFO: subsample: 1 2 2 1 1 1 1 2018-05-04 15:28:18,108 (e2e_asr_attctc_th:175) INFO: Use label smoothing with unigram 2018-05-04 15:28:26,029 (e2e_asr_attctc_th:1927) INFO: BLSTM with every-layer projection for encoder Traceback (most recent call last): File "/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/asr_recog.py", line 117, in
main()
File "/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/asr_recog.py", line 111, in main
recog(args)
File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 327, in recog
model.load_state_dict(torch.load(args.model, map_location=cpu_loader))
File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict
.format(name))
KeyError: 'unexpected key "module.predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'