espnet / espnet

End-to-End Speech Processing Toolkit
https://espnet.github.io/espnet/
Apache License 2.0
8.44k stars 2.18k forks source link

espnet2 ASR recipe #1490

Closed sw005320 closed 4 years ago

sw005320 commented 4 years ago

We're thinking of converting espnet ASR recipes to new espnet (espnet2) ASR recipes (https://github.com/espnet/espnet/tree/v.0.7.0/egs2). The following is a current assignment. I did not finish the assignment of some recipes, and if you volunteer to do it, please let me know!

@ftshijt, @Emrys365, @sas91, @YosukeHiguchi, @simpleoier, Thanks a lot for helping it! This is a temporal assignment. Please let me know if you have any requests for the assignment. Also, if you have any problems, comments on our new design, etc., you may use this issue.

b-flo commented 4 years ago

Hi,

If you want I can take chime5, how2, vivos or/and yesno recipes!

sw005320 commented 4 years ago

Hi,

If you want I can take chime5, how2, vivos or/and yesno recipes!

that is very helpful! thanks!

sw005320 commented 4 years ago

Added @ruizhilijhu's assignment.

sw005320 commented 4 years ago

@ftshijt, @Emrys365, @sas91, @YosukeHiguchi, @simpleoier, @ruizhilijhu, What is your progress? Maybe you can refer #1497.

kamo-naoyuki commented 4 years ago

Note that I focused only on creating better design system and I didn't have much times to check whether it could reproduce the previous performance.

Please not trust all codes and it's also very helpful to check the training procedures. @sw005320 Please add voxforge and wsj. I just put the recipe. I mainly worked on RNN architecture, so I haven't see the result of transformer yet.

sw005320 commented 4 years ago

Note that I focused only on creating better design system and I didn't have much times to check whether it could reproduce the previous performance.

Could you let me know where possibly the result would be changed? I know that refactoring changes the order of the initialization and we could not reproduce the same result, so I already did not care about reproducibility. But if there would be potential algorithmic changes, we need some care. Could you list such items if any?

kamo-naoyuki commented 4 years ago

I think you don''t need to care the new Trainer itself. Even if there are some bugs, I could follow it.

1. There are possibility under the espnet2/asr/encoder and decoder. I tried to merge E2E classes of RNN and Transformer to one class in espnet2 at the first. They has different interface each other, so I needed cut codes from them and unify them into same interface. I'm not original comitter of both of them, so I might take some mistakes. Actually there are a few codes in espnet2 about model structure and almost all are referred from espnet2, so you have already read all of espnet2 parts, then it's O.K.

2. I implemented BatchSampler based batching system for pytorch's DataLoader. I intended to make it perform same behaviour as espnet1's batchify, but there could be difference because I implemented from scratch. Note that I haven't implemented bin or frame mode yet. Actually I don't understand what it do.

3. ESPnet2 uses on-the-fly text processing. This is one of the special feature of espnet2, and also it is potentially very dangerous because we can't see the actual status of the tokens without debugger. I read your tokenizaton scrips and I unified them in

I'm not sure whether I didn't take some mistakes and. I also dropped some features from the original implementation: nchar-tokenization, phn mode. (As phn, I couldn't understand what it do.) The original run.sh may, and is permitted to, take some task depended text processing and I couldn't follow all . On-the-fly text processing forces us to make different flow in many places. We need to care it especially.

There are no esp-decay scheduling for adadelta. We can use lr-scheduler instead of it.

5.

The feature extractor is different: kaldi-fbank-pitch -> Pytorch-stft + libros-fbank (I believe this is no problem) The audio data is also normalized to [-1,1] in espnet2 though Kaldi never do it.

CMVN stats is calculated using collect stats mode instead of Kaldi command. I checked that it has same values, so I think it's O.K. (I have read the original codes of Kaldi in the past)

sw005320 commented 4 years ago

Thanks! Then, all of them seem to be minor and we should not have significant performance changes in theory.

AdolfVonKleist commented 4 years ago

If we are interested in sharing/submitting recipes for new corpora, should we focus on on v2 from the get-go?

simpleoier commented 4 years ago

Can anyone tell me how I can use the external text resources to train the language model, e.g. in Librispeech asr recipe? It seems that only text with utt-id can be used in the espnet2. If I use the text without utt-id, then the first word will be cut off in here.

kamo-naoyuki commented 4 years ago

@simpleoier

Can anyone tell me how I can use the external text resources to train the language model, e.g. in Librispeech asr recipe? It seems that only text with utt-id can be used in the espnet2. If I use the text without utt-id, then the first word will be cut off in here.

I'm waiting anyone ask me about it. You can use it with just giving unique ids to each lines yourself. It should be done in local/data.sh . e.g. <text | awk '{ print NR, $1 }' . This is wsj example.

https://github.com/espnet/espnet/blob/e0fd073a70bcded6a0e6a3587630410a994ccdb8/egs2/wsj/asr1/local/data.sh#L39-L42

sw005320 commented 4 years ago

If we are interested in sharing/submitting recipes for new corpora, should we focus on on v2 from the get-go?

My suggestion is to go with v1 for now. The transition from v1 to v2 would not be very difficult.

simpleoier commented 4 years ago

Can I ask about why the lr_scheduler, noamlr, only supports pytorch version >= 1.1.0 now? In espnet1, it also supports 1.0.1, doesn't it?

kamo-naoyuki commented 4 years ago

Noam lr scheduler is implemented using batch-step Scheduler of PyTorch, which was implemented in 1.1.

kamo-naoyuki commented 4 years ago

Does anyone test transformer with espnet2? I found the result is very worse in voxforge recipe and got different result from espnet1, so something is wrong. I couldn't find the reason, I'd like anyone to investigate it.

sw005320 commented 4 years ago

Does anyone test transformer with espnet2? I found the result is very worse in voxforge recipe and got different result from espnet1, so something is wrong. I couldn't find the reason, I'd like anyone to investigate it.

@YosukeHiguchi, can you try it with your JSUT setup?

b-flo commented 4 years ago

Does anyone test transformer with espnet2? I found the result is very worse in voxforge recipe and got different result from espnet1, so something is wrong.

I did test transformer for vivos and results were bad. I just tried for how2 and it's more extreme:

2020-01-25 12:21:51,764 (reporter:183) INFO: 1epoch:train:1-357batch: loss=298.730, loss_att=339279.252, loss_ctc=576.931, acc=0.006, cer=1.089, wer=1.000, cer_ctc=4.385, lr_0=5.000
/home/fboye/espnet/espnet2/train/reporter.py:92: UserWarning: No valid stats found
  warnings.warn("No valid stats found")
2020-01-25 12:23:25,022 (reporter:183) INFO: 1epoch:train:358-714batch: loss=nan, loss_att=43192.529, loss_ctc=nan, acc=0.009, cer=0.911, wer=1.000, cer_ctc=3.269, lr_0=5.000
2020-01-25 12:24:58,882 (reporter:183) INFO: 1epoch:train:715-1071batch: loss=nan, loss_att=87852.724, loss_ctc=nan, acc=0.009, cer=0.914, wer=1.000, cer_ctc=4.070, lr_0=5.000
2020-01-25 12:26:31,414 (reporter:183) INFO: 1epoch:train:1072-1428batch: loss=nan, loss_att=39662.855, loss_ctc=nan, acc=0.010, cer=0.917, wer=1.000, cer_ctc=3.747, lr_0=5.000
2020-01-25 12:28:06,389 (reporter:183) INFO: 1epoch:train:1429-1785batch: loss=nan, loss_att=33076.978, loss_ctc=nan, acc=0.010, cer=0.912, wer=1.000, cer_ctc=4.056, lr_0=5.000
2020-01-25 12:29:36,650 (reporter:183) INFO: 1epoch:train:1786-2142batch: loss=nan, loss_att=30769.789, loss_ctc=nan, acc=0.010, cer=0.900, wer=1.000, cer_ctc=3.685, lr_0=5.000
2020-01-25 12:31:10,302 (reporter:183) INFO: 1epoch:train:2143-2499batch: loss=nan, loss_att=20457.835, loss_ctc=nan, acc=0.010, cer=0.918, wer=1.000, cer_ctc=4.100, lr_0=5.000
...
2020-01-25 12:52:11,296 (x2num:14) WARNING: NaN or Inf found in input tensor.
2020-01-25 12:52:11,297 (x2num:14) WARNING: NaN or Inf found in input tensor.
2020-01-25 12:52:11,297 (x2num:14) WARNING: NaN or Inf found in input tensor.
2020-01-25 12:52:11,297 (x2num:14) WARNING: NaN or Inf found in input tensor.
...

I couldn't find the reason, I'd like anyone to investigate it.

Not sure if I can pinpoint the problem but I'll investigate.

kamo-naoyuki commented 4 years ago

I found a bug just now, I'll fix it tomorrow. Thanks.

simpleoier commented 4 years ago

I just got the RNN-based model on Librispeech with Adadelta. There are several points I want to mention.

  1. The model was trained without spec-Aug technique, so the acc and loss may be worse.
  2. The training on each epoch is slower than before (est. only 70% speed).

Here is the training curve. Do you guys have any comments? acc loss

sw005320 commented 4 years ago

Which optimizer did you use? Epsilon-decay one (based on espent1) or others?

simpleoier commented 4 years ago

torch.optim.Adadelta with hyper-parameters (lr=1.0, rho=0.95, eps=1e-8, weight_decay=0.0)

kamo-naoyuki commented 4 years ago

As far speed, report_cer=True report_wer=True now, but it takes quite some times (I forgot to enable for debugging).

Could you test ReduceLROnPlateau? This is one of the difference from v1 and I'd like to compare the result.

Emrys365 commented 4 years ago

Hi, I just finished and tested the recipes for aishell, timit and hkust. Here are the results.

It seems the results are much worse than those in the original recipes. Do you have some suggestions on the config?

Environments

  • python version: 3.7.3 (default, Mar 27 2019, 22:11:17) [GCC 7.3.0]

  • espnet version: espnet 0.6.0

  • pytorch version: pytorch 1.1.0

  • Git hash: e0fd073a70bcded6a0e6a3587630410a994ccdb8

    • Commit date: Sat Jan 11 06:09:24 2020 +0900
  • Results on AiShell

    asr_train_asr_rnn_fbank_pitch_char

    train config

    • Encoder: 3-layer VGG-BLSTMP with 1024 units
    • Decoder: 2-layer LSTM with 1024 units

      CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_devdecode_asr_rnn_lm_valid.loss.best_asr_model_valid.acc.best 14326 205341 90.3 9.5 0.2 0.1 9.8 57.5
decode_testdecode_asr_rnn_lm_valid.loss.best_asr_model_valid.acc.best 7176 104765 89.2 10.5 0.4 0.2 11.0 60.0

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_devdecode_asr_rnn_lm_valid.loss.best_asr_model_valid.acc.best 14326 14326 42.5 57.5 0.0 0.0 57.5 57.5
decode_testdecode_asr_rnn_lm_valid.loss.best_asr_model_valid.acc.best 7176 7176 40.0 60.0 0.0 0.0 60.0 60.0

asr_train_asr_transformer_fbank_pitch_char

train config

  • Encoder: 12 layers, 2048 units
  • Decoder: 6 layers, 2048 units

    CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_devdecode_asr_transformer_lm_valid.loss.best_asr_model_valid.acc.best 14326 205341 41.9 45.7 12.3 4.7 62.8 98.8
decode_testdecode_asr_transformer_lm_valid.loss.best_asr_model_valid.acc.best 7176 104765 37.0 50.6 12.4 7.6 70.6 99.2

WER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_devdecode_asr_transformer_lm_valid.loss.best_asr_model_valid.acc.best 14326 14326 1.2 98.8 0.0 0.0 98.8 98.8
decode_testdecode_asr_transformer_lm_valid.loss.best_asr_model_valid.acc.best 7176 7176 0.8 99.2 0.0 0.0 99.2 99.2
b-flo commented 4 years ago

@Emrys365 Hi,

Just looking at aishell and original config, I would suggest :

encoder_conf:
...
    use_projection: False
...
decoder_conf:
...
    att_conf:
        adim: 1024
...
val_scheduler_criterion:
    - valid
    - acc
best_model_criterion:
-   - valid
    - acc
    - max

Edit: Projection layers are used in original config but from what I tested results were better without so I'm not sure (and btw dropout can't be applied if use_projection=True because RNNP is a stack of 1-layer torch.nn.LSTM and dropout is defined within it)

Emrys365 commented 4 years ago

@Emrys365 Hi,

Just looking at aishell and original config, I would suggest :

...

Edit: Projection layers are used in original config but from what I tested results were better without so I'm not sure (and btw dropout can't be applied if use_projection=True because RNNP is a stack of 1-layer torch.nn.LSTM and dropout is defined within it)

Thank you for your suggestion! I will try this config.

kamo-naoyuki commented 4 years ago

@Emrys365 Thanks! Did you check #1533 ? Maybe GRU is used due to bug, but it doesn't affect so much in my experiments. How about trying another init method? I'm investigating the performance difference between 1.0.1 and 1.4.0 now. xavier_uniform and xavier_normal seems better than chainer init with v1.4.0 in voxforge. I also found that init=None, i.e. pytorch default, is quite worse in v1.4.0.

Possibly we'll make the second round to refine configuration of all recipes..., so please send PR without much care about the result ( Of course, I'm grad you'll check).

sw005320 commented 4 years ago

Possibly we'll make the second round to refine configuration of all recipes..., so please send PR without much care about the result ( Of course, I'm grad you'll check).

I agree with this direction. Anyway, it is great that immigration to espnet2 seems to be fine for most recipes.

Emrys365 commented 4 years ago

@Emrys365 Thanks! Did you check #1533 ? Maybe GRU is used due to bug, but it doesn't affect so much in my experiments. How about trying another init method? I'm investigating the performance difference between 1.0.1 and 1.4.0 now. xavier_uniform and xavier_normal seems better than chainer init with v1.4.0 in voxforge. I also found that init=None, i.e. pytorch default, is quite worse in v1.4.0.

Possibly we'll make the second round to refine configuration of all recipes..., so please send PR without much care about the result ( Of course, I'm grad you'll check).

Thanks! I'll check the issue and prepare the initial PR.

sw005320 commented 4 years ago

Hi all, Could you please check the new config in #1601 and redo experiments? Please carefully check the difference between espnet1 and espnet2.

kamo-naoyuki commented 4 years ago

Please also use pytorch=1.4 and builtin-ctc i.e. don't use warp-ctc. In my voxforge experiments, there are no big difference about it.

kamo-naoyuki commented 4 years ago

Hi all. I'll be off from github for 2-3 weeks, sorry.

ReinholdM commented 4 years ago

@Emrys365 hi! I just turn the "trans_type" from char to phn to get the PER. I want to know which config do you use in your results to get WER in your results. Is any config else except "trans_type" need to change?

Hi, I just finished and tested the recipes for aishell, timit and hkust. Here are the results.

It seems the results are much worse than those in the original recipes. Do you have some suggestions on the config? ....

Emrys365 commented 4 years ago

@Emrys365 hi! I just turn the "trans_type" from char to phn to get the PER. I want to know which config do you use in your results to get WER in your results. Is any config else except "trans_type" need to change?

@ReinholdM Thanks! You can check the training config here: AiShell train config (rnn), AiShell train config (transformer).

BTW, I already put the hyperlinks ("train config" in blue color) in the post you cite. You could try with that config first. I am not very sure whether we need to change the parameters for the 'phn' type.

ReinholdM commented 4 years ago

@Emrys365 I wanna ask your config to get WER on TIMIT. But I just got 28%~30% performance when I only change the "trans_type" from char to phn. So I want to know your changement when you get the WER results on TIMIT. Thanks!

@Emrys365 hi! I just turn the "trans_type" from char to phn to get the PER. I want to know which config do you use in your results to get WER in your results. Is any config else except "trans_type" need to change?

@ReinholdM Thanks! You can check the training config here: AiShell train config (rnn), AiShell train config (transformer).

BTW, I already put the hyperlinks ("train config" in blue color) in the post you cite. You could try with that config first. I am not very sure whether we need to change the parameters for the 'phn' type.

Emrys365 commented 4 years ago

@ReinholdM Did you use the configuration here?

tonysy commented 4 years ago

@Emrys365 Hi, I would like to know how to reproduce the performance reported in the Readme of espnet2/aishell1/asr, I found the performance is far from the reported in the readme. And I use the lateset version of the ESPNet(both tried with PyTorch 1.1 & 1.4). Thanks a lot.

Emrys365 commented 4 years ago

@tonysy Sorry, I haven't been following the latest ESPnet for some time. I will test my configuration with the new one these days.

tonysy commented 4 years ago

@Emrys365 Hi, can you share the commit id and pytorch version, python version to reproduce the performance reported in the README.md, thanks.

Emrys365 commented 4 years ago

@tonysy Sure, You can check this commit: https://github.com/espnet/espnet/pull/1549 (pytorch 1.1.0 and python 3.7.3)

kamo-naoyuki commented 4 years ago

@tonysy More information must be provided when asking the other. Please show acc/loss graph and the WER/CER results at least.

Emrys365 commented 4 years ago

Hi @tonysy, here are the results I've got with the latest ESPnet and PyTorch v1.1.0:

RESULTS

Environments

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_dev_decode_asr_rnn_lm_train_lm_char_valid.loss.best_asr_model_valid.acc.best 14326 205341 92.5 7.3 0.2 0.1 7.6 49.8
decode_test_decode_asr_rnn_lm_train_lm_char_valid.loss.best_asr_model_valid.acc.best 7176 104765 91.4 8.4 0.3 0.2 8.8 53.6

asr_train_asr_transformer_fbank_pitch_char

CER

dataset Snt Wrd Corr Sub Del Ins Err S.Err
decode_dev_decode_asr_transformer_lm_train_lm_char_valid.loss.best_asr_model_valid.acc.best 14326 205341 81.3 16.6 2.1 0.5 19.2 72.5
decode_test_decode_asr_transformer_lm_train_lm_char_valid.loss.best_asr_model_valid.acc.best 7176 104765 79.1 18.3 2.7 0.9 21.8 74.5

The RNN results are very similar to those in espnet/espnet:master/egs2/aishell/asr1/README.md, and the Transformer results are much better now.

stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] commented 4 years ago

This issue is closed. Please re-open if needed.

xcrpkuss commented 4 years ago

如果我们有兴趣共享/提交新语料库的食谱,我们是否应该一开始就专注于v2?

我的建议是暂时使用v1。 从v1到v2的过渡并不困难。

Hi all, Could you please check the new config in #1601 and redo experiments? Please carefully check the difference between espnet1 and espnet2.

Is file in egs corresponding to espnet1 and egs2 to espnet2? Do I better to use espnet1 and use the shell in file egs?

xcrpkuss commented 4 years ago

如果我们有兴趣共享/提交新语料库的食谱,我们是否应该一开始就专注于v2?

我的建议是暂时使用v1。 从v1到v2的过渡并不困难。

Is there some documents introduced these folders? Thank you! image

b-flo commented 4 years ago

Hi,

Is file in egs corresponding to espnet1 and egs2 to espnet2?

Yes, egs contains ESPnet1 recipe while egs2 contains ESPnet2 recipe.

Do I better to use espnet1 and use the shell in file egs?

Sorry, I'm not sure I understand...

Is there some documents introduced these folders?

I don't think there is a doc introducing these folders. You may find some informations in https://espnet.github.io/espnet/index.html . Each folder name should be self-explanatory though.