Describe the bug I am training a librispech transformers ASR model by using the recipe in /espnet/egs2/librispeech/asr1/ dir

cat conf/train_asr_transformer.yaml batch_type: numel batch_bins: 400000 accum_grad: 4 max_epoch: 40 patience: none

The initialization method for model parameters

init: xavier_uniform best_model_criterion:

- valid
- acc
- max keep_nbest_models: 10

encoder: transformer encoder_conf: output_size: 512 attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.0 input_layer: conv2d normalize_before: true

decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.0 src_attention_dropout_rate: 0.0

model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false

optim: adam optim_conf: lr: 0.002 scheduler: warmuplr scheduler_conf: warmup_steps: 25000

specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range:

0
30 num_freq_mask: 2 apply_time_mask: true time_mask_width_range:
0
40 num_time_mask: 2

cat conf/decode_asr.yaml beam_size: 10 ctc_weight: 0.3 lm_weight: 0.0

Basic environments:

OS information: [e.g., Linux commit 8a8709e6579a593eef2e98433a7e69a6ae1c828f (HEAD -> master, origin/master, origin/HEAD) Merge: f85f4927d 8f70993f7 Author: Shinji Watanabe sw005320@gmail.com Date: Sat Sep 23 07:57:22 2023 -0400

Merge pull request #5120 from pyf98/whisper-public

Support Whisper-style training as a new task S2T

Error logs in asr.sh, I have set

use_lm=true # Use language model for ASR decoding.

use_lm=false # Use language model for ASR decoding.

All other files, configs are as it is in the latest remote master branch as of Sep 23, 2023 Merge: f85f4927d 8f70993f7

As I launch, bash run.sh,

the scripts fails at stage 11, i.e asr_model training with log. Please note that I'm using the default conf files (please see the contents of the conf files above). I'm not setting the docoder to None anywhere and it seems it's being created by some embedded factory method. Can you please point me tot his factory model code and some documentation of how it works.

Sorry, but In general it would be nice to have just 4 main models (CTC, RNN-T, LAS, Transformers) with their own model classes and a user should be able to train using them rather than a complex codebase of multiple models, their configs, their hidden factory constructors etc which leads to these kind of bugs and they are hard to debug for a end-user who has not created the library. Thank you for you help in debugging this bug.

2023-09-24T15:22:45 (asr.sh:1189:main) Stage 10: ASR collect stats: train_set=dump/raw/train_960, valid_set=dump/raw/dev 2023-09-24T15:22:45 (asr.sh:1240:main) Generate 'exp/asr_stats_raw_en_char/run.sh'. You can resume the process from stage 10 using this script 2023-09-24T15:22:45 (asr.sh:1244:main) ASR collect-stats started... log: 'exp/asr_stats_raw_en_char/logdir/stats..log' run.pl: 32 / 32 failed, log is in exp/asr_stats_raw_en_char/logdir/stats..log

python3 -m espnet2.bin.asr_train --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/en_token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_shape_file exp/asr_stats_raw_en_char/logdir/train.1.scp --valid_shape_file exp/asr_stats_raw_en_char/logdir/valid.1.scp --output_dir exp/asr_stats_raw_en_char/logdir/stats.1 --frontend_conf fs=16k --train_data_path_and_name_and_type dump/raw/train_960/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/train_960/text,text,text --valid_data_path_and_name_and_type dump/raw/dev/text,text,text

Started at Sun Sep 24 15:22:45 IST 2023

# /home/vivek/espnet/tools/venv/bin/python3 /home/vivek/espnet/espnet2/bin/asr_train.py --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/en_token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_shape_file exp/asr_stats_raw_en_char/logdir/train.1.scp --valid_shape_file exp/asr_stats_raw_en_char/logdir/valid.1.scp --output_dir exp/asr_stats_raw_en_char/logdir/stats.1 --frontend_conf fs=16k --train_data_path_and_name_and_type dump/raw/train_960/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/train_960/text,text,text --valid_data_path_and_name_and_type dump/raw/dev/text,text,text [vivek-deeplearning] 2023-09-24 15:22:49,733 (asr:490) INFO: Vocabulary size: 31 Traceback (most recent call last): File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 23, in main() File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 19, in main ASRTask.main(cmd=cmd) File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1082, in main cls.main_worker(args) File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1192, in main_worker model = cls.build_model(args=args) File "/home/vivek/espnet/espnet2/tasks/asr.py", line 582, in build_model model = model_class( File "/home/vivek/espnet/espnet2/asr/espnet_model.py", line 165, in init decoder is not None AssertionError: decoder should not be None when attention is used

Accounting: time=6 threads=1

So I printed args in abs.py line 569 which is setting decoder to None as in args it's missing.

My questions/request please provide good documentation on these factory methods that you are using in abs_task.py so that we a user can debug these issues. It would be preferable to instead provide 4-10 direct model classes and then instantiate one of them from a simple if elif code block depending on a model_name string in .yaml file

This current style of abstract base class looks very cool from a coding point of view but quickly becomes error prone for research codebase which is changing constantly :)

And if you plan to keep this abs_task.py please provide good documentation in the espnet tutorial on its logic to instantiate different kind of models, encoder, decoder.

So, on some debugging I found that asr_config and inference_config vars in run.sh are again getting initialized to empty strings in asr.sh which then leads the factory methods in abs_task.py etc to run into this error.

So, please revisit my request. It would be far better to simplify the code with may be

just 4-5 model classes (RNN-T, CTC, Transformers, LAS) and instantiate them directly through this user provided params (keeping some reasonable defaults).
Provide code for model training for corresponding model.
decoder code for the main models as above

This would avoid complex factory methods, abstract base class etc which are good concepts in software engineering but not necessary for R&D codebase where the focus should be on the model that we are studying, its training code and finally its decoding code.

Sorry for these simplified request but I feel your library will make more impact within the R&D community with a simplified design. Thank you for considering my requests. :)

Thanks for your suggestions. I agree that each decoder network is now complicated, and refactoring each decoder network would be a good direction (or at least better documentation, as you suggested). I added the feature request tag and will make it a future action item.

@pyf98, could you take a look at this issue? I also could not understand why the following changes broke the stage 11.

#use_lm=true # Use language model for ASR decoding.
use_lm=false # Use language model for ASR decoding.

Dear Prof. Shinji,

So the bug is as follows. When I used the default confs for both training and decoding (Please see the confs in the first message), these conf have set ctc_weight 0.3. This means that it will use a weighted sum of the transformers autoregressive decoding based output char cross entropy loss w.r.t to the ground truth char as well as the CTC loss. However the factory methods in the abs_task.py may be not instantiating the CTC loss computation for this config and hence decoder object is None which leads to the assert error in asr/espnet_model.py".

cat conf/train_asr_transformer.yaml
batch_type: numel
batch_bins: 400000
accum_grad: 4
max_epoch: 40
patience: none
...
encoder: transformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d
normalize_before: true

decoder: transformer
decoder_conf:
attention_heads: 8
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0

model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false

optim: adam
optim_conf:
lr: 0.002
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000

To circumvent this problem, I then set the ctc weight to be 0.0 in the

conf/train_asr_transformer.yaml

This them starts the training but even after 3 epochs on full Librsipeech dataset, the recognition output is very bad. It is basically outputting the same sub-string, "the world is full of strange the world is full of strange the world is full of strange" for the test wav files with minor variations.

So it seems the encoder hidden representations are not being consider when I set ctc weight =0.0 in the train_asr_transformer.yaml

Please let me know if I can request one of your team members for a 30 min Google meet call at a time that suits them and I can then share my screen of my espnet installation and go over these logs and debug it together.

Kind regards Vivek

This teh starts the training but ven after 10 epochs on full Librsipeech dataset, the recognition output is very bad. It is basically outputting the same sub-string, "the world is full of strange the world is full of strange the world is full of strange" for teh test wav files with minor variations.

So it seems the encoder hidden representations are not being consider when I set ctc weight =0.0 in the train_asr_transformer.yaml

Can you paste the log file? I'm not sure whether it is really due to the encoder hidden representations are not being consider

It would be great if you could share the log file and learning curves (they are created under the experimental directory during training).

So for the above conf, where I have set ctc_weight=0, please find here the training log.

So we expect this to be pure transformer with the autoregressive decoder's predicted char's cross entropy loss w.r.t to the ground truth char in the sequence. I has set token_type=char in the asr.sh

# Tokenization related
token_type=char     # Tokenization type (char or bpe).
nbpe=30             # The number of BPE vocabulary.
bpemode=unigram     # Mode of BPE (unigram or bpe).

train.log Please find the train.log attached.

The above training had done 3 epochs on full librispeech, and I just rant the inference but setting the stage to 12 in asr.sh and please find its on of the sub-logs here. As you can see it's printing almost same sentence for all the wav files. asr_inference.1.log

2023-09-25 10:57:50,945 (asr_inference:359) INFO: Decoding device=cpu, dtype=float32 2023-09-25 10:57:51,092 (asr_inference:397) INFO: Text tokenizer: CharTokenizer(space_symbol=""non_linguistic_symbols="set()"nonsplit_symbols="set()") 2023-09-25 10:57:51,431 (asr:461) INFO: Optional Data Names: ('text_spk2', 'text_spk3', 'text_spk4') 2023-09-25 10:57:51,938 (asr_inference:445) INFO: speech length: 166960 2023-09-25 10:57:53,700 (beam_search:415) INFO: decoder input length: 325 2023-09-25 10:57:53,701 (beam_search:416) INFO: max output length: 325 2023-09-25 10:57:53,701 (beam_search:417) INFO: min output length: 0 2023-09-25 11:00:49,702 (beam_search:429) INFO: end detected at 246 2023-09-25 11:00:49,704 (beam_search:454) INFO: -36.59 * 1.0 = -36.59 for decoder 2023-09-25 11:00:49,704 (beam_search:457) INFO: total log probability: -36.59 2023-09-25 11:00:49,704 (beam_search:458) INFO: normalized log probability: -0.63 2023-09-25 11:00:49,704 (beam_search:459) INFO: total number of ended hypotheses: 41 2023-09-25 11:00:49,705 (beam_search:461) INFO: best hypo: THEREWASNOTHINGMORETHANANYOTHERTHINGINTHEWORLD

2023-09-25 11:00:49,894 (asr_inference:445) INFO: speech length: 52400 2023-09-25 11:00:50,436 (beam_search:415) INFO: decoder input length: 101 2023-09-25 11:00:50,436 (beam_search:416) INFO: max output length: 101 2023-09-25 11:00:50,436 (beam_search:417) INFO: min output length: 0 2023-09-25 11:01:22,431 (batch_beam_search:398) INFO: adding in the last position in the loop 2023-09-25 11:01:22,433 (beam_search:432) INFO: no hypothesis. Finish decoding. 2023-09-25 11:01:22,433 (beam_search:454) INFO: -36.42 * 1.0 = -36.42 for decoder 2023-09-25 11:01:22,433 (beam_search:457) INFO: total log probability: -36.42 2023-09-25 11:01:22,433 (beam_search:458) INFO: normalized log probability: -0.63 2023-09-25 11:01:22,434 (beam_search:459) INFO: total number of ended hypotheses: 14 2023-09-25 11:01:22,434 (beam_search:461) INFO: best hypo: THEREWASNOTHINGMORETHANANYOTHERTHINGINTHEWORLD

2023-09-25 11:01:22,443 (asr_inference:445) INFO: speech length: 106000 2023-09-25 11:01:23,706 (beam_search:415) INFO: decoder input length: 206 2023-09-25 11:01:23,706 (beam_search:416) INFO: max output length: 206 2023-09-25 11:01:23,706 (beam_search:417) INFO: min output length: 0 2023-09-25 11:03:10,213 (beam_search:429) INFO: end detected at 203 2023-09-25 11:03:10,227 (beam_search:454) INFO: -36.87 * 1.0 = -36.87 for decoder 2023-09-25 11:03:10,227 (beam_search:457) INFO: total log probability: -36.87 2023-09-25 11:03:10,227 (beam_search:458) INFO: normalized log probability: -0.64 2023-09-25 11:03:10,228 (beam_search:459) INFO: total number of ended hypotheses: 24 2023-09-25 11:03:10,228 (beam_search:461) INFO: best hypo: THEREWASNOTHINGMORETHANANYOTHERTHINGINTHEWORLD

2023-09-25 11:03:10,237 (asr_inference:445) INFO: speech length: 42880 2023-09-25 11:03:10,752 (beam_search:415) INFO: decoder input length: 83 2023-09-25 11:03:10,753 (beam_search:416) INFO: max output length: 83 2023-09-25 11:03:10,753 (beam_search:417) INFO: min output length: 0 2023-09-25 11:03:34,594 (batch_beam_search:398) INFO: adding in the last position in the loop 2023-09-25 11:03:34,596 (beam_search:432) INFO: no hypothesis. Finish decoding. 2023-09-25 11:03:34,597 (beam_search:454) INFO: -13.04 * 1.0 = -13.04 for decoder 2023-09-25 11:03:34,597 (beam_search:457) INFO: total log probability: -13.04 2023-09-25 11:03:34,597 (beam_search:458) INFO: normalized log probability: -0.69 2023-09-25 11:03:34,597 (beam_search:459) INFO: total number of ended hypotheses: 17 2023-09-25 11:03:34,597 (beam_search:461) INFO: best hypo: THEREWASNOTHING

Can you tune the learning rate? The loss is initially decreased. So, this would be due to the optimization issue.

Thanks Prof. Shinji,

Thank you and you are right that may be a problem :)

Unfortunately the transformers learning rate schedule is still a big mystery to me. Currently the default value in the conf/train_asr_transformer.yaml is set to 0.002. Should I set it to 1e-5. And please point me to the code/class that may be doing the learning rate scheduler from a warmup to a decay :)

And I do see warmup step =25000 in thsi conf and I will increase it to something like 50,000 just to use a low learning rate initially. Will that be okay Prof. Shinji? Kind regards Vivek

optim: adam
optim_conf:
    lr: 0.002
scheduler: warmuplr
scheduler_conf:
    warmup_steps: 25000

I'd recommend you simply reduce the learning rate. If it does not work, we may also consider changing the warmup_steps. You may find a lot of examples of how we change the optimization-related hyper-parameters by checking conf files in https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1/conf/tuning

Thank you very much Prof. Shinji, I have relaunched the training withlr=1e-5.

A bit surprising that all other conf files in asr1/conf/tuning, all seem to have lr =1e-3

I will update you after 12-18 hours as to what is the training loss and intermediate decoding.

Kind regards Vivek

So the bug is as follows. When I used the default confs for both training and decoding (Please see the confs in the first message), these conf have set ctc_weight 0.3. This means that it will use a weighted sum of the transformers autoregressive decoding based output char cross entropy loss w.r.t to the ground truth char as well as the CTC loss. However the factory methods in the abs_task.py may be not instantiating the CTC loss computation for this config and hence decoder object is None which leads to the assert error in asr/espnet_model.py".

I do not understand this. Joint CTC/attention is very standard method in ESPnet and there is no issue.

I has set token_type=char in the asr.sh

This is the default value for token type. It can be overwritten by run.sh.

So Yifan, yes joint loss is not a problem but the code seems to have some bug somewhere because when I runbash run.shwith default conf files, i.e conf/train_asr_transformer.yaml in the dir /home/vivek/espnet/egs2/librispeech/asr1 it throws the following error (see at the bettom of this message). Therefore to circumvent that problem I had set ctc_weight =0 in the conf/train_asr_transformer.yaml file. But this second training also runs into an issue in the sense that even after 4 full epochs on full librispeech dataset, it is outputting almost the same output for all the wav files (Please see my messages above). For that Prof. Shinji has advised that I change the learning_rate from default 1e-3 to something lower like 1e-5. So, I'm running that training too now and it seems to have the same behaviour.

So there are two issues.

With default confs (i.e ctc_weight=0.3), the training crashes with error decoder cannot be None.
With ctc weight =0.0, the bash run.sh is able to run but even after 3-4 epochs, it recognition output is the almost the same for all the wav files, which indicates that something is wrong, either learning rate or the scheduler etc.

Also note that default asr.sh has token_type=bpe which I have changed to token_type=char as I would like to train a model with just 26 chars, and space, '', '' etc. making it 31 chars or outputs in the final Transformer decoder layer.


encoder: transformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d
normalize_before: true

decoder: transformer
decoder_conf:
attention_heads: 8
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0

model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false

optim: adam
optim_conf:
lr: 0.002
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000

specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2

error log with default conf files where ctc_weight =0.3 I suggest that you can reproduce this as follows. Install a fresh install of espnet and kaldi on your machine.


cd espnet/egs2/librispeech/asr1
bash run.sh

and you will see these errors.

/home/vivek/espnet/tools/venv/bin/python3 /home/vivek/espnet/espnet2/bin/asr_train.py --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/en_token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_shape_file exp/asr_stats_raw_en_char/logdir/train.1.scp --valid_shape_file exp/asr_stats_raw_en_char/logdir/valid.1.scp --output_dir exp/asr_stats_raw_en_char/logdir/stats.1 --frontend_conf fs=16k --train_data_path_and_name_and_type dump/raw/train_960/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/train_960/text,text,text --valid_data_path_and_name_and_type dump/raw/dev/text,text,text
[vivek-deeplearning] 2023-09-24 15:22:49,733 (asr:490) INFO: Vocabulary size: 31
Traceback (most recent call last):
File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 23, in
main()
File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 19, in main
ASRTask.main(cmd=cmd)
File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1082, in main
cls.main_worker(args)
File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1192, in main_worker
model = cls.build_model(args=args)
File "/home/vivek/espnet/espnet2/tasks/asr.py", line 582, in build_model
model = model_class(
File "/home/vivek/espnet/espnet2/asr/espnet_model.py", line 165, in init
decoder is not None
AssertionError: decoder should not be None when attention is used

So Yifan, here is the output of

(base) vivek@vivek-deeplearning:~/espnet/egs2/librispeech/asr1$ git diff
diff --git a/egs2/TEMPLATE/asr1/asr.sh b/egs2/TEMPLATE/asr1/asr.sh
index c25e0b1ad..cdf310f56 100755
--- a/egs2/TEMPLATE/asr1/asr.sh
+++ b/egs2/TEMPLATE/asr1/asr.sh
@@ -24,7 +24,9 @@ min() {
 SECONDS=0

 # General configuration
-stage=1              # Processes starts from the specified stage.
+#stage=1              # Processes starts from the specified stage.
+#stage=6            # Processes starts from the specified stage.
+stage=6            # Processes starts from the specified stage.
 stop_stage=10000     # Processes is stopped at the specified stage.
 skip_stages=         # Spicify the stage to be skipped
 skip_data_prep=false # Skip data preparation stages.
@@ -60,7 +62,7 @@ min_wav_duration=0.1 # Minimum duration in second.
 max_wav_duration=20  # Maximum duration in second.

 # Tokenization related
-token_type=bpe      # Tokenization type (char or bpe).
+token_type=char     # Tokenization type (char or bpe).
 nbpe=30             # The number of BPE vocabulary.
 bpemode=unigram     # Mode of BPE (unigram or bpe).
 oov="<unk>"         # Out of vocabulary symbol.
@@ -77,7 +79,8 @@ ngram_exp=
 ngram_num=3

 # Language model related
-use_lm=true       # Use language model for ASR decoding.
+#use_lm=true       # Use language model for ASR decoding.
+use_lm=false       # Use language model for ASR decoding.
 lm_tag=           # Suffix to the result dir for language model training.
 lm_exp=           # Specify the directory path for LM experiment.
                   # If this option is specified, lm_tag is ignored.
@@ -96,7 +99,7 @@ asr_tag=       # Suffix to the result dir for asr model training.
 asr_exp=       # Specify the directory path for ASR experiment.
                # If this option is specified, asr_tag is ignored.

and

diff --git a/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml b/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml
index 8958728c6..51f829338 100644
--- a/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml
+++ b/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml
@@ -1,7 +1,7 @@
 batch_type: numel
-batch_bins: 16000000
+batch_bins: 400000
 accum_grad: 4
-max_epoch: 200
+max_epoch: 40
 patience: none
 # The initialization method for model parameters
 init: xavier_uniform
@@ -34,16 +34,16 @@ decoder_conf:
     src_attention_dropout_rate: 0.0

 model_conf:
-    ctc_weight: 0.3
+    ctc_weight: 0.0
     lsm_weight: 0.1
     length_normalized_loss: false

 optim: adam
 optim_conf:
-    lr: 0.002
+    lr: 0.00001
 scheduler: warmuplr
 scheduler_conf:
-    warmup_steps: 25000
+    warmup_steps: 80000

 specaug: specaug
 specaug_conf:

espnet / espnet

AssertionError: decoder should not be None when attention is used #5443

The initialization method for model parameters

use_lm=true # Use language model for ASR decoding.

Started at Sun Sep 24 15:22:45 IST 2023

Accounting: time=6 threads=1