TensorSpeech / TensorFlowASR

:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
https://huylenguyen.com/asr
Apache License 2.0
938 stars 245 forks source link

State of the Art for conformer and beam decoding #106

Open abhinavg4 opened 3 years ago

abhinavg4 commented 3 years ago

Hi, Thanks for developing this great tool kit. I had 2 questions about the conformer model :-

  1. For the conformer model in examples/conformer , I think almost all the parameter are similar to conformer(S) of https://arxiv.org/pdf/2005.08100.pdf . However, the performance gap between the paper and conformer model in examples/conformer seems to be quite big (2.7 v/s 6.44 for test-clean). What do you think might be the reason for this?

One reason I can see is that 2.7 is obtained with beam-search whereas 6.44 without. But I don't think just beam search can bring that difference. Can you give me some pointers on how can I reduce this gap? Also, Did you try decoding with beam search for examples/conformer ?

  1. I was trying to decode examples/conformer with beam search test_subword_conformer.py using the pre-trained model provided via drive. For this I just modified beam-width parameter in config.yml. But the decoding is taking very large time (about 30 min per batch, the total number of batches in test clean ~650) on Nvidia p40 with 24GB memory.

Is this the expected behaviour or do I need to something more than changing beam-width from 0 to 4/8. What was the decoding time for you?

Thanks, Abhinav

nglehuy commented 3 years ago

Hi @abhigarg-iitk

  1. I tried using beam search but it was still around 6%. I only think of one reason is that the model is not fully converged, since it was trained for like 25 epochs. You can see in the transducer loss image in the example, the gap between val loss and train loss was still big and it seems that the losses could reduce more if I trained it longer. Unfortunately, at this time I don't have resources to continue training model.
  2. I tested on CPU, the greedy and beam search only took around 60s total per batch with each batch with batch_size=1. Were you mistaken the time of each batch for the time of all batches? I think on GPU, you should use larger batch size to decode instead of 1.
abhinavg4 commented 3 years ago
  1. Thanks I will look into that

  2. Actually I tested on GPU as well as CPU and yes you are right for both it takes about 60s per batch with batch_size=1. Also however using a batch size greater than 1 is not helping as it still loops through each batch one by one. I agree with your comment in #58 , But I think we can sort the test data based on length to minimize the padding. Also anyways I think we can have an option to do beam search decoding over batch instead of iterating over it one at a time. Are you planning to add an option for batch beam decoding?

    2.1 - My initial impression was that parallel_iteration in the while_loop might process each element in a batch parallelly, but in practice, I didn't observe it. The beam search was iterating each sample one by one

    2.2. Please have a look at #110 too

Thanks

nglehuy commented 3 years ago

@abhigarg-iitk I'm also planning to use batch dimension directly in the decoding, I'll find a way to do that ASAP.

gandroz commented 3 years ago

Hi @abhigarg-iitk I agree with you, I did not suceed to reach the same WER either even after more than 50 epochs (maybe still not fully converged). However, the difference between the beam search and greedy search results is not that much telling me two things:

Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory.

abhinavg4 commented 3 years ago

Hi @gandroz ,

In my opinion

- at such WER, the model is so efficient that a beam search does not improve the predictions
- or there is an issue in the beam search implementation

I think the first statement might be true. Unless we have a shallow fusion with an LM the beam search might not be that effective. For reference see table 3 in this work. Although #123 has good point. maybe we can have a look at some of the standard beam searches of espnet

Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory.

Although conformer paper doesn't mention explicitly the vocab size, contextnet paper mentions using 1k word-piece model and I assume conformer might be using the same vocab. Moreover maybe we can infer the vocab size using number of parameter mentioned in the conformer paper.

gandroz commented 3 years ago

Hi @abhigarg-iitk

I contacted the first author of the paper and here is his answer:

Regarding the tokenizer, we use an internal word-piece model with a vocabulary of 1K. Regarding training recipes, we 'only' train on the Librispeech 970 hours Train-dataset.

ESPNet reproduced our results and integrated Conformer into their toolkit and posted strong results on Librispeech without Language model fusion: 1.9/4.9/2.1/4.9 (dev/devother/test/testother). Note, their model used a Transformer decoder (compared to our RNN-T decoder which should as well help improve over these results).

On the other hand, we also have open-sourced our implementation of the Conformer Layer in the encoder which might be helpful to refer to. Hope this helps!

abhinavg4 commented 3 years ago

Hi @gandroz ,

Thanks for this answer. I had looked earlier into the Lingvo implementation of conformer. And one strange contrast was the use of ff in the convolution module instead of pointwise conv used in the original paper. Also the class name says "Lightweight conv layer" which also has a mention in the paper.

Infact I also trying replacing pointwise conv with ff layers but the results were somewhat worse. Although I didn't check my implementation throughly.

Even Espnet seems to use pointwise conv and not ff link.

nglehuy commented 3 years ago

@abhigarg-iitk @gandroz Before changing the beam search, I already inherited the beam search code from ESPNet here and tested, the WER was lower but not much different from greedy.

gandroz commented 3 years ago

I made a quick review of the model code and did not find any great difference with the ESPNet implementation. Maybe in the decoder... The paper refers to a single layer LSTM whereas the transformer decoder in ESPNet seems to add MHA layers.

gandroz commented 3 years ago

Also in ESPNet:

Our Conformer model consists of a Conformer encoder proposed in [7] and a Transformer decoder

In ASR tasks, the Conformer model predicts a target sequence Y of characters or byte-pair-encoding (BPE) tokens2 from an input sequence X of 80 dimensional log-mel filterbank features with/without 3-dimensional pitch features. X is first sub-sampled in a convolutional layer by a factor of 4, as in [4], and then fed into the encoder and decoder to compute the cross-entropy (CE) loss. The encoder output is also used to compute a connectionist temporal classification (CTC) loss [17] for joint CTC-attention training and decoding [18]. During inference, token-level or word-level language model (LM) [19] is combined via shallow fusion.

So definitevely, the approach from ESPNet is not purely the one exposed in the conformer paper.

nglehuy commented 3 years ago

There're 2 things I'm not sure in the paper: variational noise and the structure of prediction and joint networks. I don't know if they have dense layers right after the encoder and prediction net or only dense after adding 2 inputs, layernorm or projection in the prediction net. The contextnet paper says the structure is from this paper which says the vocabulary is 4096 word-pieces.

gandroz commented 3 years ago

@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details

pourfard commented 3 years ago

Hi, After 10 epochs the transducer_loss is about 3.5 for my training data (300 hours). But the test results are not promising. What is the transducer_loss after 20 or 30 epochs for your training data (LibriSpeech)? does it gets under 1.0 after 20-30 epochs? should I still wait for 20 more epochs? because every 5 epochs takes 1 day on my 1080Ti.

Here is my training log. @gandroz @usimarit @abhigarg-iitk

gandroz commented 3 years ago

@pourfard I could not say... I'm using the whole training dataset (960h) and after 50 epochs, the losses were 22.16 ont the training datasets and 6.3 on the dev ones. And yes, it is very long to train...

image

tund commented 3 years ago

@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details

Hi @gandroz : have you got something back from the conformer's authors?

gandroz commented 3 years ago

@tund not yet, I'll try to poke him again tomorrow

tund commented 3 years ago

Thanks @gandroz