Open gandroz opened 3 years ago
@gandroz Wow cool, if you got the same result for test-other
then you should check the transcript file to see if it points to test-other files. And you should check the test-clean
transcripts file too.
Anyway, I'm thinking that maybe the authors have some tricks that reduce the result to 2.7%
that we didn't see.
And one more thing is that there's a very small difference between greedy
and beam search
at this kind of WER percent, so we can ignore the difference and test only on greedy
to see if it reduces to near 2.7-3%
, for getting faster results
I'll try to continue training for several epochs, training seems not to have ended. I'll read the paper again to look for any clue on how to reduce WER even more.
But I dont have anything special in my transcripts, both test-clean
and test-other
are well segregated.
@gandroz You should check or generate the transcript file again, may be when creating test-other
transcript file, you point to the test-clean
directory.
If everything is right, then it's so weird haha :laughing:
I checked both files, my config file too and got the same results. So weird. I'll try to debug to find any mistake
Le sam. 23 janv. 2021 13:03, Nguyễn Lê Huy notifications@github.com a écrit :
@gandroz https://github.com/gandroz You should check or generate the transcript file again, may be when creating test-other transcript file, you point to the test-clean directory. If everything is right, then it's so weird haha 😆
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TensorSpeech/TensorFlowASR/issues/124#issuecomment-766152916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCXOANR2CFFQ2EDBTOUSDDS3MFP3ANCNFSM4WOP6C2A .
I found why I always got the same test metrics.... I tested on the test-clean
dataset and it saved a test.tsv
file, but each time I performed another test, as there was already an existing file, only the metrics were computed and no inference was done. I've cleaned this file and have launched another test with the test-other
dataset to continue the update.
@gandroz Can you post your full config file you are using to generate the ~5% WER results?
Thanks!!!
@ncilfone sure !
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
output_path_prefix: /data/models/asr/conformer_sentencepiece_subword
model_type: unigram
target_vocab_size: 1024
blank_at_zero: True
beam_width: 5
norm_score: True
corpus_files:
- /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
- /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
- /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid_concat
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0.1
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 1
prediction_layer_norm: True
prediction_projection_units: 0
joint_dim: 320
joint_activation: tanh
learning_config:
augmentations:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
dataset_config:
train_paths:
- /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
- /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
- /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
eval_paths:
- /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
- /data/datasets/LibriSpeech/dev-other/transcripts.tsv
test_paths:
- /data/datasets/LibriSpeech/test-clean/transcripts.tsv
- /data/datasets/LibriSpeech/test-other/transcripts.tsv
tfrecords_dir: null
optimizer_config:
warmup_steps: 10000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 2
accumulation_steps: 4
num_epochs: 50
outdir: /data/models/asr/conformer_sentencepiece_subword
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
checkpoint:
filepath: /data/models/asr/conformer_sentencepiece_subword/checkpoints/{epoch:02d}.h5
save_best_only: True
save_weights_only: False
save_freq: epoch
states_dir: /data/models/asr/conformer_sentencepiece_subword/states
tensorboard:
log_dir: /data/models/asr/conformer_sentencepiece_subword/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: 'epoch'
profile_batch: 2
I used a sentencepiece (unigram) model as vocab, currently trying with the BPE version
Thanks @gandroz!
Is that the vocab here: vocabularies/librispeech_train_4_1030.subwords
Edit: Based on the config it seems like you might generate one before training?
Also is this just single GPU training?
no it's not that vocab. However, you can train yours with script\generate_vocab_sentencepiece.py
giving your config file.
And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.
Yeah just realized that you generate it based on the config options. Thanks for letting me know!
I'm assuming you are doing the featurization of the WAV files in TF as the stft etc. should be a bit faster on the GPU. DALI might be another place to look too although I've never used it...
Final question I promise... It looks like you are using
I think the best way to accelerate processing is to pre-process fbank just as it done on fairseq.
For your information, featurization is done by the class tensorflow_asr\featurizers\speech_featurizers.py::TFSpeechFeaturizer
.
I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?
I'm not sure to understand well your question. Sentencepiece is an unsupervised text tokenizer and detokenizer so you have to train a model on the transcripts from LibriSpeech. Tokenized transcripts are padded to the biggest sentence during training for each batch.
Ugh forgot that markdown will remove the notation I used... This is what I meant...
It looks like you are using <sos>
and <eos>
tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?
Oh I see. You are right, transcripts does not have those tokens and they are useless as far as I understand it. However, you can add them when encoding some text. You could find more details on the repo, and I've just realized that there is a tensorflow binding.... I think I'll try it instead of the python implementation I used.
Hi @gandroz ,
Have you tested on test-other
set, and what is the result?
Thanks!
@tund not yet, it took me a week to test on test-clean
and I did not have time yet
Thanks for your reply @gandroz . Since the performance using beam-search is quite close to the greedy-search, I think only running greedy-search will be much faster. Another question: do you use Gradient Accumulation for trainng? I saw: "accumulation_steps: 4" in the config file, but not sure what your training command exactly is.
Indeed, I could just perform greedy search for this test. In a near future perhaps... And yes, I used gradient accumulation.
@gandroz any chance you can post your loss curves?
sure
The glitches at the end are due to infinite loop bug corrected afterwards (evaluation occured endlessly after training ended). I trained the model for 40 epochs first and continued for 10 more epochs.
How you are able to achieve such good results with your models? I've trained conformed subword model, but it stops improving after ~20 epochs.
I've updated Keras trainer to use EarlyStopping and stops the training process after 5 epochs without improvement to validation loss.
What am I missing?
Train data: 50hrs Eval data: 7hrs Using TF RNN Loss
Audio lengths. Not sure :
mean 2.646981
std 2.420535
min 0.100000
25% 0.900000
50% 1.570000
75% 4.030000
max 20.000000
The test results are complete rubbish:
G_WER = 114.837982
G_CER = 88.0064
B_WER = 100
B_CER = 100
BLM_WER = 100
BLM_CER = 100
config
speech_config:
sample_rate: 16000
frame_ms: 25
stride_ms: 10
num_feature_bins: 80
feature_type: log_mel_spectrogram
preemphasis: 0.97
normalize_signal: True
normalize_feature: True
normalize_per_feature: False
decoder_config:
vocabulary: vocabularies/lithuanian.subwords
target_vocab_size: 4096
max_subword_length: 4
blank_at_zero: True
beam_width: 0
norm_score: True
corpus_files:
- /tf_asr/manifests/liepa.tsv
model_config:
name: conformer
encoder_subsampling:
type: conv2d
filters: 144
kernel_size: 3
strides: 2
encoder_positional_encoding: sinusoid_concat
encoder_dmodel: 144
encoder_num_blocks: 16
encoder_head_size: 36
encoder_num_heads: 4
encoder_mha_type: relmha
encoder_kernel_size: 32
encoder_fc_factor: 0.5
encoder_dropout: 0.1
prediction_embed_dim: 320
prediction_embed_dropout: 0
prediction_num_rnns: 1
prediction_rnn_units: 320
prediction_rnn_type: lstm
prediction_rnn_implementation: 2
prediction_layer_norm: False
prediction_projection_units: 0
joint_dim: 320
joint_activation: tanh
learning_config:
train_dataset_config:
use_tf: True
augmentation_config:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
data_paths:
- /tf_asr/manifests/liepa_train.tsv
tfrecords_dir: /tf_asr/tfrecords/tfrecords-train
shuffle: True
cache: False
buffer_size: 100
drop_remainder: True
eval_dataset_config:
use_tf: True
data_paths:
- /tf_asr/manifests/liepa_eval.tsv
tfrecords_dir: /tf_asr/tfrecords/tfrecords-eval
shuffle: False
cache: False
buffer_size: 100
drop_remainder: True
test_dataset_config:
use_tf: True
data_paths:
- /tf_asr/manifests/liepa_test.tsv
tfrecords_dir: /tf_asr/tfrecords/tfrecords-test
shuffle: False
cache: False
buffer_size: 100
drop_remainder: True
optimizer_config:
warmup_steps: 40000
beta1: 0.9
beta2: 0.98
epsilon: 1e-9
running_config:
batch_size: 2
accumulation_steps: 4
num_epochs: 20
outdir: /tf_asr/models
log_interval_steps: 300
eval_interval_steps: 500
save_interval_steps: 1000
early_stopping:
monitor: "val_val_rnnt_loss"
mode: "min"
patience: 5
verbose: 1
checkpoint:
filepath: /tf_asr/models/checkpoints/epoch-{epoch:02d}-{val_val_rnnt_loss:.4f}.h5
save_best_only: True
save_weights_only: False
save_freq: epoch
verbose: 1
monitor: "val_val_rnnt_loss"
mode: "min"
states_dir: /tf_asr/models/states
tensorboard:
log_dir: /tf_asr/models/tensorboard
histogram_freq: 1
write_graph: True
write_images: True
update_freq: 'epoch'
profile_batch: 2
@mjurkus Could you show the loss curves?
@mjurkus my training was performed over the LibriSpeech data, 960h of data for training. ASR needs lots of data to converge, so maybe you need more. Furthermore, maybe LibriSpeech data is cleaner than yours ? I also have some proprietary data but they are way worse than LibriSpeech (not even the same sampling rate). But perhaps you could share the training curves ?
Yeah, the amount of data is the answer... That's what I thought.
Here's couple: Very clean, 16k data, 50hrs:
Mixed data: clean and noisy, 16k, 100hrs:
It's hard to get good labeled data for my language.
Your model does not seem to learn anything.... Try to reduce your LR, explore some data augmentation as it could help.
Using conformer with characters worked way better, than using subwords. Managed to get decent results (WER ~15%) do not have the graphs for those, though.
Regarding augmentation - I figured, that this config enables augmentation.
augmentation_config:
after:
time_masking:
num_masks: 10
mask_factor: 100
p_upperbound: 0.05
freq_masking:
num_masks: 1
mask_factor: 27
I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?
I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?
@jinggaizi What vocabulary size did you use, 1k or 4k or english characters (around 28)?
1k
@jinggaizi no, I have no news from the author. I could try to email him again, he's smart. However, I am surprise by the WER you achieved with ESPNET. They say they had much better results (however I suspect it was not with the small model, but anyway). Have you use the RNNT or a transformer as a decoder ? When ESPNET announced they had same or better results than the paper, it was with a transformer as you can see in their sources.
Maybe you could ask ESPNET how they manage to achieve such good results.... on which machine, which config etc.
@usimarit thank for your reply, my result used RNNT as decoder, encoder is small size conformer, decoder is 1 lstm layer(dim=320) and dimension of join network is 640. espnet (https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1)have no RNNT result and i suspect that it's better because speed augmentation
@gandroz hi, have you any news from the author, do you train the model on GPU or TPU? Have you ever tried a larger batch size, i assume google always use a larger batch size. i only worked on titan xp with small batch size, maybe larger batch size can improve the result of transducer
@jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean.
It's seem like larger batch size doesn't work, i have no new idea发自我的华为手机-------- 原始邮件 --------发件人: Nicholas Cilfone notifications@github.com日期: 2021年2月23日周二 晚上10:02收件人: TensorSpeech/TensorFlowASR TensorFlowASR@noreply.github.com抄送: jinggaizi jingbojun@126.com, Mention mention@noreply.github.com主 题: Re: [TensorSpeech/TensorFlowASR] WER for conformer update (#124) @jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean.
—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.
@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient.
@ncilfone what' version of GPU with 2048 batch size. did you improve the RNNT training refer to https://arxiv.org/pdf/1909.12415.pdf
Just a follow up with the author of the paper. I asked him some clues to try to find how we can achieve the same results. I asked a question about the dataset and whether the model was pre-trained or not, and asked for details on the hyperparameters not always mentionned in the paper. He was kind enough to answer me, but not that much details to help us a lot. Here it is
Re: training set. We use the Librispeech 960h train set as mentioned in our paper.
Re: batch sizes. What batch-size do you use and what's the WER do you see on Librispeech Dev/Devother/Test/Testother datasets? I think this can be one reason, I can actually run an experiment with the same small batch size as yours and update you with the result. We ran our experiments on a batch size of 2048 and trained till 90-100k steps. To evaluate, we sampled 5 ckpts and picked the best one based on the dev/devother performance. Let me know what settings do you use and I can train and report back to you with the results.
So maybe a major difference comes from the batch size which is.....HUGE ! I really dont know how they manage to train the large (or even the small) model with so much data. Maybe an avenue could be to split the model over multiple GPU instead or replicating the model on multiple GPU. We could surely increase the batch size doing so.
Thanks @gandroz, they have their HUGE TPUs, that's why they're able to get SOTA results. I'll try to implement gradient accumulation in keras builtin function and test on colab TPUs, hope it will get nearer to their result.
Hi @usimarit , I see high bias issue - rnnt_loss in 240s and does not go down further in keras conformer trainer (both keras and non-keras version). I tried learning rate of - 0.5/ sqrt(dmodel), 0.05/sqrt(dmodel), 0.005/sqrt(dmodel) with 960 hours librispeech. There is not much difference in the loss curve. Please let me know if I need to modify anything in the config file to train a model that matches the WER performance of reference latest.h5 (WER of 6.5 in my testing). Thanks
@MadhuAtBerkeley I trained with that config on google drive, except that I used use_tf: False
(the config on drive is not updated to latest version but it still has the same meaning)
@usimarit Thanks! I confirm that use_tf:False does help and now I see loss curve going below 100.
Why set use_tf to False help the training as both tf version and numpy version perform similar method?
I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?
Hi, could you please post your config in espnet?
Why set use_tf to False help the training as both tf version and numpy version perform similar method?
The only difference is the numpy version uses nlpaug
which randomly chooses time masking and freq masking to do augmentation where the tf version applies both time and freq masking.
The tf version works fine for me on TPUs.
I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?
Hi, could you please post your config in espnet?
`batch-size: 6 maxlen-in: 800 maxlen-out: 150
criterion: loss early-stop-criterion: "validation/main/loss" sortagrad: 0 opt: noam epochs: 50 patience: 0 accum-grad: 4 grad-clip: 5.0
etype: transformer enc-block-arch:
transformer-lr: 10 transformer-warmup-steps: 25000
transformer-enc-positional-encoding-type: rel_pos transformer-enc-self-attn-type: rel_self_attn
rnnt-mode: 'rnnt' # switch to 'rnnt-att' to use transducer with attention model-module: "espnet.nets.pytorch_backend.e2e_asr_transducer:E2E`
@gandroz hi, have any response from the author , running some experience with small batchsize. do you try to use other methods to improve the result
@jinggaizi No, not any news from the author, I'll let you know as soon as I have. I cannot work on the project for the moment, so nothing news from me either
no it's not that vocab. However, you can train yours with
script\generate_vocab_sentencepiece.py
giving your config file. And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.
Hey, thanks for the updated config! Any rough estimates of how long it took to train (I'm guessing a few days at least)? Also, any luck with pre-computing fbanks?
@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient.
Hello, is gradient accumulation not supported in the latest version (v1.0.0)?
Hi, I've just ended a training of a conformer using the sentencepiece featurizer on LibriSpeech over 50 epochs. Here are the results if you want to update your readme:
Test results: G_WER = 5.22291565 G_CER = 1.9693377 B_WER = 5.19438553 B_CER = 1.95449066 BLM_WER = 100 BLM_CER = 100
The strange part is that I dot the same metrics on
test-other
dataset hmmm...