espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.44k stars 2.18k forks source link

unexpected keyError during decoding #154

Closed chiayuli closed 6 years ago

chiayuli commented 6 years ago

Hi all, I'm the beginner for ESpnet and I followed the instructions from ESpnet/egs/wsj/asr1/run.sh the training of language model and acoustic model looks fine,

export CUDA_VISIBLE_DEVICES=0 ; ./run.sh --ngpu 1 --backend pytorch --etype blstmp

but I face problem during decoding... the following message is from decode.2.log Does anyone face the same problem ? Any suggestion will be appreciated. Thanks.

2018-05-04 15:26:04,898 (asr_recog:97) INFO: python path = /mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/lm/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/asr/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/nets/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/utils/:/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/: 2018-05-04 15:26:04,898 (asr_recog:102) INFO: set random seed = 1 2018-05-04 15:26:04,898 (asr_recog:105) INFO: backend = pytorch 2018-05-04 15:28:18,098 (asr_pytorch:314) INFO: reading a model config file fromexp/train_si284_503-lm/results/model.conf 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: backend: pytorch 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: beam_size: 20 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: ctc_weight: 0.3 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: debugmode: 1 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: gpu: None 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: lm_weight: 1.0 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: maxlenratio: 0.0 2018-05-04 15:28:18,106 (asr_pytorch:318) INFO: ARGS: minlenratio: 0.0 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: model: exp/train_si284_503-lm/results/model.acc.best 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: model_conf: exp/train_si284_503-lm/results/model.conf 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: nbest: 1 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: ngpu: 0 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: penalty: 0.0 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: recog_feat: ark,s,cs:apply-cmvn --norm-vars=true data/train_si284/cmvn.ark scp:data/test_dev93/split32utt/2/feats.scp ark:- | 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: recog_label: data/test_dev93/data.json 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: result_label: exp/train_si284_503-lm/decode_test_dev93_beam20_eacc.best_p0.0_len0.0-0.0_ctcw0.3_rnnlm1.0/data.2.json 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: rnnlm: exp/train_rnnlm_2layer_bs2048/rnnlm.model.best 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: seed: 1 2018-05-04 15:28:18,107 (asr_pytorch:318) INFO: ARGS: verbose: 1 2018-05-04 15:28:18,107 (asr_pytorch:321) INFO: reading model parameters fromexp/train_si284_503-lm/results/model.acc.best 2018-05-04 15:28:18,108 (e2e_asr_attctc_th:170) INFO: subsample: 1 2 2 1 1 1 1 2018-05-04 15:28:18,108 (e2e_asr_attctc_th:175) INFO: Use label smoothing with unigram 2018-05-04 15:28:26,029 (e2e_asr_attctc_th:1927) INFO: BLSTM with every-layer projection for encoder Traceback (most recent call last): File "/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/asr_recog.py", line 117, in main() File "/mount/arbeitsdaten/asr/licu/Espnet/egs/wsj/asr1/../../../src/bin/asr_recog.py", line 111, in main recog(args) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 327, in recog model.load_state_dict(torch.load(args.model, map_location=cpu_loader)) File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict .format(name)) KeyError: 'unexpected key "module.predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'

sw005320 commented 6 years ago

Does this only happen in decode.2.log, and others are fine? Then, I'm expecting this is just a problem of some accidental data access. If this always happens, it would be due to some bugs. Can you just re-run only decoding by adding the option of --stage 5?

chiayuli commented 6 years ago

No, this happen in all decode.*.log. After I re-run it (adding the option of --stage 5), and it happened again.

sw005320 commented 6 years ago

Thanks. I'll check it.

sw005320 commented 6 years ago

I did not test the completely same setup, but I did not observe the issue. The training may have some issues. Can you take a look at exp/.../train.log? Also, can you check the model exists at exp/.../results/model.acc.best ?

kan-bayashi commented 6 years ago

@chiayuli @sw005320 This is caused by the torch.nn.DataParallel. We have to change the saving function when using DataParallel as follows: (This is from my another project's codes)

    if args.n_gpus > 1:
        torch.save({"model": model.module.state_dict()}, args.expdir + "/checkpoint-final.pkl")
    else:
        torch.save({"model": model.state_dict()}, args.expdir + "/checkpoint-final.pkl")

@bobchennan Coud you fix it?

chiayuli commented 6 years ago

Thanks, I'll try it and feedback to you.

bobchennan commented 6 years ago

Yes that is caused by DataParallel. I will fix it soon.

chiayuli commented 6 years ago

Hi all, I modified the code in asr_pytorch.py as #157 But it occurs other error (torch_load) during training acoustic model. Is there any modification to torch_load(path, obj) function? Many Thanks

=== commands === export CUDA_VISIBLE_DEVICES=0,2,3 ; nohup ./run.sh --ngpu 3 --stage 4 --backend pytorch --etype blstmp >> run.log& === log === Exception in main training loop: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict' Traceback (most recent call last): File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run entry.extension(self) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot _restore_snapshot(model, snapshot, load_fn) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot load_fn(snapshot, model) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load model.load_state_dict(torch.load(path)) File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict .format(name)) Will finalize trainer extensions and updater before reraising the exception. ESC[JTraceback (most recent call last): File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 196, in main() File "/mount/arbeitsdaten/asr/licu/Espnet/egs/chime5/asr1/../../../src/bin/asr_train.py", line 190, in main train(args) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 308, in train trainer.run() File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 320, in run six.reraise(*sys.exc_info()) File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/chainer/training/trainer.py", line 309, in run entry.extension(self) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 110, in restore_snapshot _restore_snapshot(model, snapshot, load_fn) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_utils.py", line 116, in _restore_snapshot load_fn(snapshot, model) File "/mount/arbeitsdaten/asr/licu/Espnet/src/asr/asr_pytorch.py", line 270, in torch_load model.load_state_dict(torch.load(path)) File "/mount/arbeitsdaten40/projekte/asr/licu/Espnet/tools/venv/lib/python2.7/site-packages/torch/nn/modules/module.py", line 522, in load_state_dict .format(name)) KeyError: 'unexpected key "predictor.enc.enc1.bilstm0.weight_ih_l0" in state_dict'

bobchennan commented 6 years ago

@chiayuli new updates of #157 should fix it.

Still for PyTorch Multi-GPU I think there are some problems. I would suggest to merge with #155 and we may test it as soon as possible.

kan-bayashi commented 6 years ago

@bobchennan There is still an error when loading the model trained with multi-gpu.

    def remove_dataparallel(state_dict):
        from collections import OrderedDict
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            if k.startswith("module."):
                name = k[7:]
                new_state_dict[name] = v
        return new_state_dict

This should be

    def remove_dataparallel(state_dict):
        from collections import OrderedDict
        new_state_dict = OrderedDict()
        for k, v in state_dict.items():
            if k.startswith("module."):
                name = k[7:]
                new_state_dict[name] = v
            else:
                new_state_dict[k] = v
        return new_state_dict

I will make PR to fix it.

bobchennan commented 6 years ago

It is included in #173 :

    for k, v in state_dict.items():
        if k.startswith("module."):
            k = k[7:]
        new_state_dict[k] = v

but I agree it is better to make a separate pull request and merge it as soon as possible.

kan-bayashi commented 6 years ago

Oh sorry, I overlooked it. I will merge fixing PR.

kan-bayashi commented 6 years ago

Now fixed.