TensorSpeech / TensorFlowASR

:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
https://huylenguyen.com/asr
Apache License 2.0
941 stars 245 forks source link

WER for conformer update #124

Open gandroz opened 3 years ago

gandroz commented 3 years ago

Hi, I've just ended a training of a conformer using the sentencepiece featurizer on LibriSpeech over 50 epochs. Here are the results if you want to update your readme:

dataset_config:
    train_paths:
      - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
      - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
      - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
    eval_paths:
      - /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/dev-other/transcripts.tsv
    test_paths:
      - /data/datasets/LibriSpeech/test-clean/transcripts.tsv

Test results: G_WER = 5.22291565 G_CER = 1.9693377 B_WER = 5.19438553 B_CER = 1.95449066 BLM_WER = 100 BLM_CER = 100

The strange part is that I dot the same metrics on test-other dataset hmmm...

nglehuy commented 3 years ago

@gandroz Wow cool, if you got the same result for test-other then you should check the transcript file to see if it points to test-other files. And you should check the test-clean transcripts file too. Anyway, I'm thinking that maybe the authors have some tricks that reduce the result to 2.7% that we didn't see.

nglehuy commented 3 years ago

And one more thing is that there's a very small difference between greedy and beam search at this kind of WER percent, so we can ignore the difference and test only on greedy to see if it reduces to near 2.7-3%, for getting faster results

gandroz commented 3 years ago

I'll try to continue training for several epochs, training seems not to have ended. I'll read the paper again to look for any clue on how to reduce WER even more. But I dont have anything special in my transcripts, both test-clean and test-other are well segregated.

nglehuy commented 3 years ago

@gandroz You should check or generate the transcript file again, may be when creating test-other transcript file, you point to the test-clean directory. If everything is right, then it's so weird haha :laughing:

gandroz commented 3 years ago

I checked both files, my config file too and got the same results. So weird. I'll try to debug to find any mistake

Le sam. 23 janv. 2021 13:03, Nguyễn Lê Huy notifications@github.com a écrit :

@gandroz https://github.com/gandroz You should check or generate the transcript file again, may be when creating test-other transcript file, you point to the test-clean directory. If everything is right, then it's so weird haha 😆

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TensorSpeech/TensorFlowASR/issues/124#issuecomment-766152916, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJCXOANR2CFFQ2EDBTOUSDDS3MFP3ANCNFSM4WOP6C2A .

gandroz commented 3 years ago

I found why I always got the same test metrics.... I tested on the test-clean dataset and it saved a test.tsv file, but each time I performed another test, as there was already an existing file, only the metrics were computed and no inference was done. I've cleaned this file and have launched another test with the test-other dataset to continue the update.

ncilfone commented 3 years ago

@gandroz Can you post your full config file you are using to generate the ~5% WER results?

Thanks!!!

gandroz commented 3 years ago

@ncilfone sure !

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  output_path_prefix: /data/models/asr/conformer_sentencepiece_subword
  model_type: unigram
  target_vocab_size: 1024
  blank_at_zero: True
  beam_width: 5
  norm_score: True
  corpus_files:
    - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
    - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
    - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0.1
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 1
  prediction_layer_norm: True
  prediction_projection_units: 0
  joint_dim: 320
  joint_activation: tanh

learning_config:
  augmentations:
    after:
      time_masking:
        num_masks: 10
        mask_factor: 100
        p_upperbound: 0.05
      freq_masking:
        num_masks: 1
        mask_factor: 27

  dataset_config:
    train_paths:
      - /data/datasets/LibriSpeech/train-clean-100/transcripts.tsv
      - /data/datasets/LibriSpeech/train-clean-360/transcripts.tsv
      - /data/datasets/LibriSpeech/train-other-500/transcripts.tsv
    eval_paths:
      - /data/datasets/LibriSpeech/dev-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/dev-other/transcripts.tsv
    test_paths:
      - /data/datasets/LibriSpeech/test-clean/transcripts.tsv
      - /data/datasets/LibriSpeech/test-other/transcripts.tsv
    tfrecords_dir: null

  optimizer_config:
    warmup_steps: 10000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    accumulation_steps: 4
    num_epochs: 50
    outdir: /data/models/asr/conformer_sentencepiece_subword
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    checkpoint:
      filepath: /data/models/asr/conformer_sentencepiece_subword/checkpoints/{epoch:02d}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
    states_dir: /data/models/asr/conformer_sentencepiece_subword/states
    tensorboard:
      log_dir: /data/models/asr/conformer_sentencepiece_subword/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2

I used a sentencepiece (unigram) model as vocab, currently trying with the BPE version

ncilfone commented 3 years ago

Thanks @gandroz!

Is that the vocab here: vocabularies/librispeech_train_4_1030.subwords

Edit: Based on the config it seems like you might generate one before training?

Also is this just single GPU training?

gandroz commented 3 years ago

no it's not that vocab. However, you can train yours with script\generate_vocab_sentencepiece.py giving your config file. And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.

ncilfone commented 3 years ago

Yeah just realized that you generate it based on the config options. Thanks for letting me know!

I'm assuming you are doing the featurization of the WAV files in TF as the stft etc. should be a bit faster on the GPU. DALI might be another place to look too although I've never used it...

ncilfone commented 3 years ago

Final question I promise... It looks like you are using and tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

gandroz commented 3 years ago

I think the best way to accelerate processing is to pre-process fbank just as it done on fairseq. For your information, featurization is done by the class tensorflow_asr\featurizers\speech_featurizers.py::TFSpeechFeaturizer.

I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

I'm not sure to understand well your question. Sentencepiece is an unsupervised text tokenizer and detokenizer so you have to train a model on the transcripts from LibriSpeech. Tokenized transcripts are padded to the biggest sentence during training for each batch.

ncilfone commented 3 years ago

Ugh forgot that markdown will remove the notation I used... This is what I meant...

It looks like you are using <sos> and <eos> tokens in SentencePiece but I'm guessing the text featurizer for the LibriSpeech transcripts doesn't have those? Or do you pad them onto each one?

gandroz commented 3 years ago

Oh I see. You are right, transcripts does not have those tokens and they are useless as far as I understand it. However, you can add them when encoding some text. You could find more details on the repo, and I've just realized that there is a tensorflow binding.... I think I'll try it instead of the python implementation I used.

tund commented 3 years ago

Hi @gandroz , Have you tested on test-other set, and what is the result? Thanks!

gandroz commented 3 years ago

@tund not yet, it took me a week to test on test-clean and I did not have time yet

tund commented 3 years ago

Thanks for your reply @gandroz . Since the performance using beam-search is quite close to the greedy-search, I think only running greedy-search will be much faster. Another question: do you use Gradient Accumulation for trainng? I saw: "accumulation_steps: 4" in the config file, but not sure what your training command exactly is.

gandroz commented 3 years ago

Indeed, I could just perform greedy search for this test. In a near future perhaps... And yes, I used gradient accumulation.

ncilfone commented 3 years ago

@gandroz any chance you can post your loss curves?

gandroz commented 3 years ago

sure image

image

The glitches at the end are due to infinite loop bug corrected afterwards (evaluation occured endlessly after training ended). I trained the model for 40 epochs first and continued for 10 more epochs.

mjurkus commented 3 years ago

How you are able to achieve such good results with your models? I've trained conformed subword model, but it stops improving after ~20 epochs.

I've updated Keras trainer to use EarlyStopping and stops the training process after 5 epochs without improvement to validation loss.

What am I missing?

Train data: 50hrs Eval data: 7hrs Using TF RNN Loss

Audio lengths. Not sure :

mean       2.646981
std        2.420535
min        0.100000
25%        0.900000
50%        1.570000
75%        4.030000
max       20.000000

The test results are complete rubbish:

G_WER = 114.837982
G_CER = 88.0064
B_WER = 100
B_CER = 100
BLM_WER = 100
BLM_CER = 100

config

speech_config:
  sample_rate: 16000
  frame_ms: 25
  stride_ms: 10
  num_feature_bins: 80
  feature_type: log_mel_spectrogram
  preemphasis: 0.97
  normalize_signal: True
  normalize_feature: True
  normalize_per_feature: False

decoder_config:
  vocabulary: vocabularies/lithuanian.subwords
  target_vocab_size: 4096
  max_subword_length: 4
  blank_at_zero: True
  beam_width: 0
  norm_score: True
  corpus_files:
    - /tf_asr/manifests/liepa.tsv

model_config:
  name: conformer
  encoder_subsampling:
    type: conv2d
    filters: 144
    kernel_size: 3
    strides: 2
  encoder_positional_encoding: sinusoid_concat
  encoder_dmodel: 144
  encoder_num_blocks: 16
  encoder_head_size: 36
  encoder_num_heads: 4
  encoder_mha_type: relmha
  encoder_kernel_size: 32
  encoder_fc_factor: 0.5
  encoder_dropout: 0.1
  prediction_embed_dim: 320
  prediction_embed_dropout: 0
  prediction_num_rnns: 1
  prediction_rnn_units: 320
  prediction_rnn_type: lstm
  prediction_rnn_implementation: 2
  prediction_layer_norm: False
  prediction_projection_units: 0
  joint_dim: 320
  joint_activation: tanh

learning_config:
  train_dataset_config:
    use_tf: True
    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27
    data_paths:
      - /tf_asr/manifests/liepa_train.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-train
    shuffle: True
    cache: False
    buffer_size: 100
    drop_remainder: True

  eval_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/liepa_eval.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-eval
    shuffle: False
    cache: False
    buffer_size: 100
    drop_remainder: True

  test_dataset_config:
    use_tf: True
    data_paths:
      - /tf_asr/manifests/liepa_test.tsv
    tfrecords_dir: /tf_asr/tfrecords/tfrecords-test
    shuffle: False
    cache: False
    buffer_size: 100
    drop_remainder: True

  optimizer_config:
    warmup_steps: 40000
    beta1: 0.9
    beta2: 0.98
    epsilon: 1e-9

  running_config:
    batch_size: 2
    accumulation_steps: 4
    num_epochs: 20
    outdir: /tf_asr/models
    log_interval_steps: 300
    eval_interval_steps: 500
    save_interval_steps: 1000
    early_stopping:
      monitor: "val_val_rnnt_loss"
      mode: "min"
      patience: 5
      verbose: 1
    checkpoint:
      filepath: /tf_asr/models/checkpoints/epoch-{epoch:02d}-{val_val_rnnt_loss:.4f}.h5
      save_best_only: True
      save_weights_only: False
      save_freq: epoch
      verbose: 1
      monitor: "val_val_rnnt_loss"
      mode: "min"
    states_dir: /tf_asr/models/states
    tensorboard:
      log_dir: /tf_asr/models/tensorboard
      histogram_freq: 1
      write_graph: True
      write_images: True
      update_freq: 'epoch'
      profile_batch: 2
nglehuy commented 3 years ago

@mjurkus Could you show the loss curves?

gandroz commented 3 years ago

@mjurkus my training was performed over the LibriSpeech data, 960h of data for training. ASR needs lots of data to converge, so maybe you need more. Furthermore, maybe LibriSpeech data is cleaner than yours ? I also have some proprietary data but they are way worse than LibriSpeech (not even the same sampling rate). But perhaps you could share the training curves ?

mjurkus commented 3 years ago

Yeah, the amount of data is the answer... That's what I thought.

Here's couple: Very clean, 16k data, 50hrs: train_rnnt_loss,val_val_rnnt_loss

Mixed data: clean and noisy, 16k, 100hrs: train_rnnt_loss,val_val_rnnt_loss (1)

It's hard to get good labeled data for my language.

gandroz commented 3 years ago

Your model does not seem to learn anything.... Try to reduce your LR, explore some data augmentation as it could help.

mjurkus commented 3 years ago

Using conformer with characters worked way better, than using subwords. Managed to get decent results (WER ~15%) do not have the graphs for those, though.

Regarding augmentation - I figured, that this config enables augmentation.

    augmentation_config:
      after:
        time_masking:
          num_masks: 10
          mask_factor: 100
          p_upperbound: 0.05
        freq_masking:
          num_masks: 1
          mask_factor: 27
jinggaizi commented 3 years ago

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

nglehuy commented 3 years ago

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

@jinggaizi What vocabulary size did you use, 1k or 4k or english characters (around 28)?

jinggaizi commented 3 years ago

1k

gandroz commented 3 years ago

@jinggaizi no, I have no news from the author. I could try to email him again, he's smart. However, I am surprise by the WER you achieved with ESPNET. They say they had much better results (however I suspect it was not with the small model, but anyway). Have you use the RNNT or a transformer as a decoder ? When ESPNET announced they had same or better results than the paper, it was with a transformer as you can see in their sources.

Maybe you could ask ESPNET how they manage to achieve such good results.... on which machine, which config etc.

jinggaizi commented 3 years ago

@usimarit thank for your reply, my result used RNNT as decoder, encoder is small size conformer, decoder is 1 lstm layer(dim=320) and dimension of join network is 640. espnet (https://github.com/espnet/espnet/tree/master/egs2/librispeech/asr1)have no RNNT result and i suspect that it's better because speed augmentation

jinggaizi commented 3 years ago

@gandroz hi, have you any news from the author, do you train the model on GPU or TPU? Have you ever tried a larger batch size, i assume google always use a larger batch size. i only worked on titan xp with small batch size, maybe larger batch size can improve the result of transducer

ncilfone commented 3 years ago

@jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean.

jinggaizi commented 3 years ago

It's seem like larger batch size doesn't work, i have no new idea发自我的华为手机-------- 原始邮件 --------发件人: Nicholas Cilfone notifications@github.com日期: 2021年2月23日周二 晚上10:02收件人: TensorSpeech/TensorFlowASR TensorFlowASR@noreply.github.com抄送: jinggaizi jingbojun@126.com, Mention mention@noreply.github.com主 题: Re: [TensorSpeech/TensorFlowASR] WER for conformer update (#124) @jinggaizi I've run it with a batch size of 2048 (which is what I think they used in the original paper taken from this ref here http://arxiv.org/abs/2011.06110) via batch accumulation on 8 GPUs (with a joint dim of 320) for days and I can barely get below 5.9% on dev-clean.

—You are receiving this because you were mentioned.Reply to this email directly, view it on GitHub, or unsubscribe.

nglehuy commented 3 years ago

@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient.

jinggaizi commented 3 years ago

@ncilfone what' version of GPU with 2048 batch size. did you improve the RNNT training refer to https://arxiv.org/pdf/1909.12415.pdf

gandroz commented 3 years ago

Just a follow up with the author of the paper. I asked him some clues to try to find how we can achieve the same results. I asked a question about the dataset and whether the model was pre-trained or not, and asked for details on the hyperparameters not always mentionned in the paper. He was kind enough to answer me, but not that much details to help us a lot. Here it is

Re: training set. We use the Librispeech 960h train set as mentioned in our paper.

Re: batch sizes. What batch-size do you use and what's the WER do you see on Librispeech Dev/Devother/Test/Testother datasets? I think this can be one reason, I can actually run an experiment with the same small batch size as yours and update you with the result. We ran our experiments on a batch size of 2048 and trained till 90-100k steps. To evaluate, we sampled 5 ckpts and picked the best one based on the dev/devother performance. Let me know what settings do you use and I can train and report back to you with the results.

So maybe a major difference comes from the batch size which is.....HUGE ! I really dont know how they manage to train the large (or even the small) model with so much data. Maybe an avenue could be to split the model over multiple GPU instead or replicating the model on multiple GPU. We could surely increase the batch size doing so.

nglehuy commented 3 years ago

Thanks @gandroz, they have their HUGE TPUs, that's why they're able to get SOTA results. I'll try to implement gradient accumulation in keras builtin function and test on colab TPUs, hope it will get nearer to their result.

MadhuAtBerkeley commented 3 years ago

Hi @usimarit , I see high bias issue - rnnt_loss in 240s and does not go down further in keras conformer trainer (both keras and non-keras version). I tried learning rate of - 0.5/ sqrt(dmodel), 0.05/sqrt(dmodel), 0.005/sqrt(dmodel) with 960 hours librispeech. There is not much difference in the loss curve. Please let me know if I need to modify anything in the config file to train a model that matches the WER performance of reference latest.h5 (WER of 6.5 in my testing). Thanks

nglehuy commented 3 years ago

@MadhuAtBerkeley I trained with that config on google drive, except that I used use_tf: False (the config on drive is not updated to latest version but it still has the same meaning)

MadhuAtBerkeley commented 3 years ago

@usimarit Thanks! I confirm that use_tf:False does help and now I see loss curve going below 100.

thanatl commented 3 years ago

Why set use_tf to False help the training as both tf version and numpy version perform similar method?

BuaaAlban commented 3 years ago

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

Hi, could you please post your config in espnet?

nglehuy commented 3 years ago

Why set use_tf to False help the training as both tf version and numpy version perform similar method?

The only difference is the numpy version uses nlpaug which randomly chooses time masking and freq masking to do augmentation where the tf version applies both time and freq masking. The tf version works fine for me on TPUs.

jinggaizi commented 3 years ago

I've just ended the training with espnet, except join_dim=640, the result of wer is test_clean:4.9, test_other:11.9, How can i get the results in the Conformer paper. @gandroz have you received any reply from conformer's authors?

Hi, could you please post your config in espnet?

`batch-size: 6 maxlen-in: 800 maxlen-out: 150

criterion: loss early-stop-criterion: "validation/main/loss" sortagrad: 0 opt: noam epochs: 50 patience: 0 accum-grad: 4 grad-clip: 5.0

etype: transformer enc-block-arch:

transformer-lr: 10 transformer-warmup-steps: 25000

transformer-enc-positional-encoding-type: rel_pos transformer-enc-self-attn-type: rel_self_attn

rnnt-mode: 'rnnt' # switch to 'rnnt-att' to use transducer with attention model-module: "espnet.nets.pytorch_backend.e2e_asr_transducer:E2E`

jinggaizi commented 3 years ago

@gandroz hi, have any response from the author , running some experience with small batchsize. do you try to use other methods to improve the result

gandroz commented 3 years ago

@jinggaizi No, not any news from the author, I'll let you know as soon as I have. I cannot work on the project for the moment, so nothing news from me either

AgaDob commented 3 years ago

no it's not that vocab. However, you can train yours with script\generate_vocab_sentencepiece.py giving your config file. And I'm training on two GTX 1080Ti. It took soooo long to train, I'm looking for a way to pre-compute the fbanks as they are computed on the fly which might take some time.

Hey, thanks for the updated config! Any rough estimates of how long it took to train (I'm guessing a few days at least)? Also, any luck with pre-computing fbanks?

changji95 commented 3 years ago

@ncilfone batch accumulation is just to mimic the large batch size, I believe they use actual large batch size, which is way more efficient.

Hello, is gradient accumulation not supported in the latest version (v1.0.0)?