Open vivektyagiibm opened 1 year ago
So I printed args in abs.py line 569 which is setting decoder to None as in args it's missing.
My questions/request please provide good documentation on these factory methods that you are using in abs_task.py so that we a user can debug these issues. It would be preferable to instead provide 4-10 direct model classes and then instantiate one of them from a simple if elif code block depending on a model_name string in .yaml file
This current style of abstract base class looks very cool from a coding point of view but quickly becomes error prone for research codebase which is changing constantly :)
And if you plan to keep this abs_task.py please provide good documentation in the espnet tutorial on its logic to instantiate different kind of models, encoder, decoder.
So, on some debugging I found that asr_config and inference_config vars in run.sh are again getting initialized to empty strings in asr.sh which then leads the factory methods in abs_task.py etc to run into this error.
So, please revisit my request. It would be far better to simplify the code with may be
This would avoid complex factory methods, abstract base class etc which are good concepts in software engineering but not necessary for R&D codebase where the focus should be on the model that we are studying, its training code and finally its decoding code.
Sorry for these simplified request but I feel your library will make more impact within the R&D community with a simplified design. Thank you for considering my requests. :)
Thanks for your suggestions. I agree that each decoder network is now complicated, and refactoring each decoder network would be a good direction (or at least better documentation, as you suggested). I added the feature request tag and will make it a future action item.
@pyf98, could you take a look at this issue? I also could not understand why the following changes broke the stage 11.
#use_lm=true # Use language model for ASR decoding.
use_lm=false # Use language model for ASR decoding.
Dear Prof. Shinji,
So the bug is as follows. When I used the default confs for both training and decoding (Please see the confs in the first message), these conf have set ctc_weight 0.3. This means that it will use a weighted sum of the transformers autoregressive decoding based output char cross entropy loss w.r.t to the ground truth char as well as the CTC loss. However the factory methods in the abs_task.py may be not instantiating the CTC loss computation for this config and hence decoder object is None which leads to the assert error in asr/espnet_model.py".
cat conf/train_asr_transformer.yaml
batch_type: numel
batch_bins: 400000
accum_grad: 4
max_epoch: 40
patience: none
...
encoder: transformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d
normalize_before: true
decoder: transformer
decoder_conf:
attention_heads: 8
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
optim: adam
optim_conf:
lr: 0.002
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000
To circumvent this problem, I then set the ctc weight to be 0.0 in the
conf/train_asr_transformer.yaml
This them starts the training but even after 3 epochs on full Librsipeech dataset, the recognition output is very bad. It is basically outputting the same sub-string, "the world is full of strange the world is full of strange the world is full of strange" for the test wav files with minor variations.
So it seems the encoder hidden representations are not being consider when I set ctc weight =0.0 in the train_asr_transformer.yaml
Please let me know if I can request one of your team members for a 30 min Google meet call at a time that suits them and I can then share my screen of my espnet installation and go over these logs and debug it together.
Kind regards Vivek
This teh starts the training but ven after 10 epochs on full Librsipeech dataset, the recognition output is very bad. It is basically outputting the same sub-string, "the world is full of strange the world is full of strange the world is full of strange" for teh test wav files with minor variations.
So it seems the encoder hidden representations are not being consider when I set ctc weight =0.0 in the train_asr_transformer.yaml
Can you paste the log file?
I'm not sure whether it is really due to the encoder hidden representations are not being consider
It would be great if you could share the log file and learning curves (they are created under the experimental directory during training).
So for the above conf, where I have set ctc_weight=0, please find here the training log.
So we expect this to be pure transformer with the autoregressive decoder's predicted char's cross entropy loss w.r.t to the ground truth char in the sequence. I has set token_type=char in the asr.sh
# Tokenization related
token_type=char # Tokenization type (char or bpe).
nbpe=30 # The number of BPE vocabulary.
bpemode=unigram # Mode of BPE (unigram or bpe).
train.log Please find the train.log attached.
The above training had done 3 epochs on full librispeech, and I just rant the inference but setting the stage to 12 in asr.sh and please find its on of the sub-logs here. As you can see it's printing almost same sentence for all the wav files. asr_inference.1.log
2023-09-25 10:57:50,945 (asr_inference:359) INFO: Decoding device=cpu, dtype=float32
2023-09-25 10:57:51,092 (asr_inference:397) INFO: Text tokenizer: CharTokenizer(space_symbol="
2023-09-25 11:00:49,894 (asr_inference:445) INFO: speech length: 52400
2023-09-25 11:00:50,436 (beam_search:415) INFO: decoder input length: 101
2023-09-25 11:00:50,436 (beam_search:416) INFO: max output length: 101
2023-09-25 11:00:50,436 (beam_search:417) INFO: min output length: 0
2023-09-25 11:01:22,431 (batch_beam_search:398) INFO: adding
2023-09-25 11:01:22,443 (asr_inference:445) INFO: speech length: 106000
2023-09-25 11:01:23,706 (beam_search:415) INFO: decoder input length: 206
2023-09-25 11:01:23,706 (beam_search:416) INFO: max output length: 206
2023-09-25 11:01:23,706 (beam_search:417) INFO: min output length: 0
2023-09-25 11:03:10,213 (beam_search:429) INFO: end detected at 203
2023-09-25 11:03:10,227 (beam_search:454) INFO: -36.87 * 1.0 = -36.87 for decoder
2023-09-25 11:03:10,227 (beam_search:457) INFO: total log probability: -36.87
2023-09-25 11:03:10,227 (beam_search:458) INFO: normalized log probability: -0.64
2023-09-25 11:03:10,228 (beam_search:459) INFO: total number of ended hypotheses: 24
2023-09-25 11:03:10,228 (beam_search:461) INFO: best hypo: THERE
2023-09-25 11:03:10,237 (asr_inference:445) INFO: speech length: 42880
2023-09-25 11:03:10,752 (beam_search:415) INFO: decoder input length: 83
2023-09-25 11:03:10,753 (beam_search:416) INFO: max output length: 83
2023-09-25 11:03:10,753 (beam_search:417) INFO: min output length: 0
2023-09-25 11:03:34,594 (batch_beam_search:398) INFO: adding
Can you tune the learning rate? The loss is initially decreased. So, this would be due to the optimization issue.
Thanks Prof. Shinji,
Thank you and you are right that may be a problem :)
Unfortunately the transformers learning rate schedule is still a big mystery to me. Currently the default value in the conf/train_asr_transformer.yaml is set to 0.002. Should I set it to 1e-5. And please point me to the code/class that may be doing the learning rate scheduler from a warmup to a decay :)
And I do see warmup step =25000 in thsi conf and I will increase it to something like 50,000 just to use a low learning rate initially. Will that be okay Prof. Shinji? Kind regards Vivek
optim: adam
optim_conf:
lr: 0.002
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000
I'd recommend you simply reduce the learning rate. If it does not work, we may also consider changing the warmup_steps. You may find a lot of examples of how we change the optimization-related hyper-parameters by checking conf files in https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1/conf/tuning
Thank you very much Prof. Shinji, I have relaunched the training withlr=1e-5.
A bit surprising that all other conf files in asr1/conf/tuning,
all seem to have lr =1e-3
I will update you after 12-18 hours as to what is the training loss and intermediate decoding.
Kind regards Vivek
So the bug is as follows. When I used the default confs for both training and decoding (Please see the confs in the first message), these conf have set ctc_weight 0.3. This means that it will use a weighted sum of the transformers autoregressive decoding based output char cross entropy loss w.r.t to the ground truth char as well as the CTC loss. However the factory methods in the abs_task.py may be not instantiating the CTC loss computation for this config and hence decoder object is None which leads to the assert error in asr/espnet_model.py".
I do not understand this. Joint CTC/attention is very standard method in ESPnet and there is no issue.
I has set token_type=char in the asr.sh
This is the default value for token type. It can be overwritten by run.sh.
So Yifan, yes joint loss is not a problem but the code seems to have some bug somewhere because when I runbash run.sh
with default conf files, i.e conf/train_asr_transformer.yaml
in the dir /home/vivek/espnet/egs2/librispeech/asr1
it throws the following error (see at the bettom of this message). Therefore to circumvent that problem I had set ctc_weight =0 in the conf/train_asr_transformer.yaml
file.
But this second training also runs into an issue in the sense that even after 4 full epochs on full librispeech dataset, it is outputting almost the same output for all the wav files (Please see my messages above). For that Prof. Shinji has advised that I change the learning_rate from default 1e-3 to something lower like 1e-5. So, I'm running that training too now and it seems to have the same behaviour.
So there are two issues.
Also note that default asr.sh has token_type=bpe
which I have changed to token_type=char
as I would like to train a model with just 26 chars, and space, '
encoder: transformer
encoder_conf:
output_size: 512
attention_heads: 8
linear_units: 2048
num_blocks: 12
dropout_rate: 0.1
positional_dropout_rate: 0.1
attention_dropout_rate: 0.0
input_layer: conv2d
normalize_before: true
decoder: transformer
decoder_conf:
attention_heads: 8
linear_units: 2048
num_blocks: 6
dropout_rate: 0.1
positional_dropout_rate: 0.1
self_attention_dropout_rate: 0.0
src_attention_dropout_rate: 0.0
model_conf:
ctc_weight: 0.3
lsm_weight: 0.1
length_normalized_loss: false
optim: adam
optim_conf:
lr: 0.002
scheduler: warmuplr
scheduler_conf:
warmup_steps: 25000
specaug: specaug
specaug_conf:
apply_time_warp: true
time_warp_window: 5
time_warp_mode: bicubic
apply_freq_mask: true
freq_mask_width_range:
- 0
- 30
num_freq_mask: 2
apply_time_mask: true
time_mask_width_range:
- 0
- 40
num_time_mask: 2
error log with default conf files where ctc_weight =0.3 I suggest that you can reproduce this as follows. Install a fresh install of espnet and kaldi on your machine.
cd espnet/egs2/librispeech/asr1
bash run.sh
and you will see these errors.
/home/vivek/espnet/tools/venv/bin/python3 /home/vivek/espnet/espnet2/bin/asr_train.py --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/en_token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_shape_file exp/asr_stats_raw_en_char/logdir/train.1.scp --valid_shape_file exp/asr_stats_raw_en_char/logdir/valid.1.scp --output_dir exp/asr_stats_raw_en_char/logdir/stats.1 --frontend_conf fs=16k --train_data_path_and_name_and_type dump/raw/train_960/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/train_960/text,text,text --valid_data_path_and_name_and_type dump/raw/dev/text,text,text
[vivek-deeplearning] 2023-09-24 15:22:49,733 (asr:490) INFO: Vocabulary size: 31
Traceback (most recent call last):
File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 23, in
main()
File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 19, in main
ASRTask.main(cmd=cmd)
File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1082, in main
cls.main_worker(args)
File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1192, in main_worker
model = cls.build_model(args=args)
File "/home/vivek/espnet/espnet2/tasks/asr.py", line 582, in build_model
model = model_class(
File "/home/vivek/espnet/espnet2/asr/espnet_model.py", line 165, in init
decoder is not None
AssertionError: decoder should not be None when attention is used
So Yifan, here is the output of
(base) vivek@vivek-deeplearning:~/espnet/egs2/librispeech/asr1$ git diff
diff --git a/egs2/TEMPLATE/asr1/asr.sh b/egs2/TEMPLATE/asr1/asr.sh
index c25e0b1ad..cdf310f56 100755
--- a/egs2/TEMPLATE/asr1/asr.sh
+++ b/egs2/TEMPLATE/asr1/asr.sh
@@ -24,7 +24,9 @@ min() {
SECONDS=0
# General configuration
-stage=1 # Processes starts from the specified stage.
+#stage=1 # Processes starts from the specified stage.
+#stage=6 # Processes starts from the specified stage.
+stage=6 # Processes starts from the specified stage.
stop_stage=10000 # Processes is stopped at the specified stage.
skip_stages= # Spicify the stage to be skipped
skip_data_prep=false # Skip data preparation stages.
@@ -60,7 +62,7 @@ min_wav_duration=0.1 # Minimum duration in second.
max_wav_duration=20 # Maximum duration in second.
# Tokenization related
-token_type=bpe # Tokenization type (char or bpe).
+token_type=char # Tokenization type (char or bpe).
nbpe=30 # The number of BPE vocabulary.
bpemode=unigram # Mode of BPE (unigram or bpe).
oov="<unk>" # Out of vocabulary symbol.
@@ -77,7 +79,8 @@ ngram_exp=
ngram_num=3
# Language model related
-use_lm=true # Use language model for ASR decoding.
+#use_lm=true # Use language model for ASR decoding.
+use_lm=false # Use language model for ASR decoding.
lm_tag= # Suffix to the result dir for language model training.
lm_exp= # Specify the directory path for LM experiment.
# If this option is specified, lm_tag is ignored.
@@ -96,7 +99,7 @@ asr_tag= # Suffix to the result dir for asr model training.
asr_exp= # Specify the directory path for ASR experiment.
# If this option is specified, asr_tag is ignored.
and
diff --git a/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml b/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml
index 8958728c6..51f829338 100644
--- a/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml
+++ b/egs2/librispeech/asr1/conf/tuning/train_asr_transformer.yaml
@@ -1,7 +1,7 @@
batch_type: numel
-batch_bins: 16000000
+batch_bins: 400000
accum_grad: 4
-max_epoch: 200
+max_epoch: 40
patience: none
# The initialization method for model parameters
init: xavier_uniform
@@ -34,16 +34,16 @@ decoder_conf:
src_attention_dropout_rate: 0.0
model_conf:
- ctc_weight: 0.3
+ ctc_weight: 0.0
lsm_weight: 0.1
length_normalized_loss: false
optim: adam
optim_conf:
- lr: 0.002
+ lr: 0.00001
scheduler: warmuplr
scheduler_conf:
- warmup_steps: 25000
+ warmup_steps: 80000
specaug: specaug
specaug_conf:
Describe the bug I am training a librispech transformers ASR model by using the recipe in /espnet/egs2/librispeech/asr1/ dir
cat conf/train_asr_transformer.yaml batch_type: numel batch_bins: 400000 accum_grad: 4 max_epoch: 40 patience: none
The initialization method for model parameters
init: xavier_uniform best_model_criterion:
encoder: transformer encoder_conf: output_size: 512 attention_heads: 8 linear_units: 2048 num_blocks: 12 dropout_rate: 0.1 positional_dropout_rate: 0.1 attention_dropout_rate: 0.0 input_layer: conv2d normalize_before: true
decoder: transformer decoder_conf: attention_heads: 8 linear_units: 2048 num_blocks: 6 dropout_rate: 0.1 positional_dropout_rate: 0.1 self_attention_dropout_rate: 0.0 src_attention_dropout_rate: 0.0
model_conf: ctc_weight: 0.3 lsm_weight: 0.1 length_normalized_loss: false
optim: adam optim_conf: lr: 0.002 scheduler: warmuplr scheduler_conf: warmup_steps: 25000
specaug: specaug specaug_conf: apply_time_warp: true time_warp_window: 5 time_warp_mode: bicubic apply_freq_mask: true freq_mask_width_range:
cat conf/decode_asr.yaml beam_size: 10 ctc_weight: 0.3 lm_weight: 0.0
Basic environments:
OS information: [e.g., Linux commit 8a8709e6579a593eef2e98433a7e69a6ae1c828f (HEAD -> master, origin/master, origin/HEAD) Merge: f85f4927d 8f70993f7 Author: Shinji Watanabe sw005320@gmail.com Date: Sat Sep 23 07:57:22 2023 -0400
Merge pull request #5120 from pyf98/whisper-public
Support Whisper-style training as a new task S2T
Error logs in asr.sh, I have set
use_lm=true # Use language model for ASR decoding.
use_lm=false # Use language model for ASR decoding.
All other files, configs are as it is in the latest remote master branch as of Sep 23, 2023 Merge: f85f4927d 8f70993f7
As I launch, bash run.sh,
the scripts fails at stage 11, i.e asr_model training with log. Please note that I'm using the default conf files (please see the contents of the conf files above). I'm not setting the docoder to None anywhere and it seems it's being created by some embedded factory method. Can you please point me tot his factory model code and some documentation of how it works.
Sorry, but In general it would be nice to have just 4 main models (CTC, RNN-T, LAS, Transformers) with their own model classes and a user should be able to train using them rather than a complex codebase of multiple models, their configs, their hidden factory constructors etc which leads to these kind of bugs and they are hard to debug for a end-user who has not created the library. Thank you for you help in debugging this bug.
2023-09-24T15:22:45 (asr.sh:1189:main) Stage 10: ASR collect stats: train_set=dump/raw/train_960, valid_set=dump/raw/dev 2023-09-24T15:22:45 (asr.sh:1240:main) Generate 'exp/asr_stats_raw_en_char/run.sh'. You can resume the process from stage 10 using this script 2023-09-24T15:22:45 (asr.sh:1244:main) ASR collect-stats started... log: 'exp/asr_stats_raw_en_char/logdir/stats..log' run.pl: 32 / 32 failed, log is in exp/asr_stats_raw_en_char/logdir/stats..log
python3 -m espnet2.bin.asr_train --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/en_token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_shape_file exp/asr_stats_raw_en_char/logdir/train.1.scp --valid_shape_file exp/asr_stats_raw_en_char/logdir/valid.1.scp --output_dir exp/asr_stats_raw_en_char/logdir/stats.1 --frontend_conf fs=16k --train_data_path_and_name_and_type dump/raw/train_960/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/train_960/text,text,text --valid_data_path_and_name_and_type dump/raw/dev/text,text,text
Started at Sun Sep 24 15:22:45 IST 2023
# /home/vivek/espnet/tools/venv/bin/python3 /home/vivek/espnet/espnet2/bin/asr_train.py --collect_stats true --use_preprocessor true --bpemodel none --token_type char --token_list data/en_token_list/char/tokens.txt --non_linguistic_symbols none --cleaner none --g2p none --train_shape_file exp/asr_stats_raw_en_char/logdir/train.1.scp --valid_shape_file exp/asr_stats_raw_en_char/logdir/valid.1.scp --output_dir exp/asr_stats_raw_en_char/logdir/stats.1 --frontend_conf fs=16k --train_data_path_and_name_and_type dump/raw/train_960/wav.scp,speech,sound --valid_data_path_and_name_and_type dump/raw/dev/wav.scp,speech,sound --train_data_path_and_name_and_type dump/raw/train_960/text,text,text --valid_data_path_and_name_and_type dump/raw/dev/text,text,text [vivek-deeplearning] 2023-09-24 15:22:49,733 (asr:490) INFO: Vocabulary size: 31 Traceback (most recent call last): File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/vivek/anaconda3/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 23, in
main()
File "/home/vivek/espnet/espnet2/bin/asr_train.py", line 19, in main
ASRTask.main(cmd=cmd)
File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1082, in main
cls.main_worker(args)
File "/home/vivek/espnet/espnet2/tasks/abs_task.py", line 1192, in main_worker
model = cls.build_model(args=args)
File "/home/vivek/espnet/espnet2/tasks/asr.py", line 582, in build_model
model = model_class(
File "/home/vivek/espnet/espnet2/asr/espnet_model.py", line 165, in init
decoder is not None
AssertionError: decoder should not be None when attention is used
Accounting: time=6 threads=1