Open abhinavg4 opened 3 years ago
Hi @abhigarg-iitk
batch_size=1
. Were you mistaken the time of each batch for the time of all batches? I think on GPU, you should use larger batch size to decode instead of 1.Thanks I will look into that
Actually I tested on GPU as well as CPU and yes you are right for both it takes about 60s per batch with batch_size=1
. Also however using a batch size greater than 1 is not helping as it still loops through each batch one by one. I agree with your comment in #58 , But I think we can sort the test data based on length to minimize the padding. Also anyways I think we can have an option to do beam search decoding over batch instead of iterating over it one at a time. Are you planning to add an option for batch beam decoding?
2.1 - My initial impression was that parallel_iteration in the while_loop might process each element in a batch parallelly, but in practice, I didn't observe it. The beam search was iterating each sample one by one
2.2. Please have a look at #110 too
Thanks
@abhigarg-iitk I'm also planning to use batch dimension directly in the decoding, I'll find a way to do that ASAP.
Hi @abhigarg-iitk I agree with you, I did not suceed to reach the same WER either even after more than 50 epochs (maybe still not fully converged). However, the difference between the beam search and greedy search results is not that much telling me two things:
Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory.
Hi @gandroz ,
In my opinion
- at such WER, the model is so efficient that a beam search does not improve the predictions
- or there is an issue in the beam search implementation
I think the first statement might be true. Unless we have a shallow fusion with an LM the beam search might not be that effective. For reference see table 3 in this work. Although #123 has good point. maybe we can have a look at some of the standard beam searches of espnet
Another improvement could come from the vocabulary. The actual example uses a vocab size of 1000, but maybe a bigger vocab could help. I tried with a vocab of size 8000 but the training does not fit in memory.
Although conformer paper doesn't mention explicitly the vocab size, contextnet paper mentions using 1k word-piece model and I assume conformer might be using the same vocab. Moreover maybe we can infer the vocab size using number of parameter mentioned in the conformer paper.
Hi @abhigarg-iitk
I contacted the first author of the paper and here is his answer:
Regarding the tokenizer, we use an internal word-piece model with a vocabulary of 1K. Regarding training recipes, we 'only' train on the Librispeech 970 hours Train-dataset.
ESPNet reproduced our results and integrated Conformer into their toolkit and posted strong results on Librispeech without Language model fusion: 1.9/4.9/2.1/4.9 (dev/devother/test/testother). Note, their model used a Transformer decoder (compared to our RNN-T decoder which should as well help improve over these results).
On the other hand, we also have open-sourced our implementation of the Conformer Layer in the encoder which might be helpful to refer to. Hope this helps!
Hi @gandroz ,
Thanks for this answer. I had looked earlier into the Lingvo implementation of conformer. And one strange contrast was the use of ff in the convolution module instead of pointwise conv used in the original paper. Also the class name says "Lightweight conv layer" which also has a mention in the paper.
Infact I also trying replacing pointwise conv with ff layers but the results were somewhat worse. Although I didn't check my implementation throughly.
Even Espnet seems to use pointwise conv and not ff link.
@abhigarg-iitk @gandroz Before changing the beam search, I already inherited the beam search code from ESPNet here and tested, the WER was lower but not much different from greedy.
I made a quick review of the model code and did not find any great difference with the ESPNet implementation. Maybe in the decoder... The paper refers to a single layer LSTM whereas the transformer decoder in ESPNet seems to add MHA layers.
Also in ESPNet:
Our Conformer model consists of a Conformer encoder proposed in [7] and a Transformer decoder
In ASR tasks, the Conformer model predicts a target sequence Y of characters or byte-pair-encoding (BPE) tokens2 from an input sequence X of 80 dimensional log-mel filterbank features with/without 3-dimensional pitch features. X is first sub-sampled in a convolutional layer by a factor of 4, as in [4], and then fed into the encoder and decoder to compute the cross-entropy (CE) loss. The encoder output is also used to compute a connectionist temporal classification (CTC) loss [17] for joint CTC-attention training and decoding [18]. During inference, token-level or word-level language model (LM) [19] is combined via shallow fusion.
So definitevely, the approach from ESPNet is not purely the one exposed in the conformer paper.
There're 2 things I'm not sure in the paper: variational noise and the structure of prediction and joint networks. I don't know if they have dense layers right after the encoder and prediction net or only dense after adding 2 inputs, layernorm or projection in the prediction net. The contextnet paper says the structure is from this paper which says the vocabulary is 4096 word-pieces.
@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details
Hi, After 10 epochs the transducer_loss is about 3.5 for my training data (300 hours). But the test results are not promising. What is the transducer_loss after 20 or 30 epochs for your training data (LibriSpeech)? does it gets under 1.0 after 20-30 epochs? should I still wait for 20 more epochs? because every 5 epochs takes 1 day on my 1080Ti.
Here is my training log. @gandroz @usimarit @abhigarg-iitk
@pourfard I could not say... I'm using the whole training dataset (960h) and after 50 epochs, the losses were 22.16 ont the training datasets and 6.3 on the dev ones. And yes, it is very long to train...
@usimarit I'm waiting for an answer about the joint network and the choices made by the conformer team. I'll let you know when I have further details
Hi @gandroz : have you got something back from the conformer's authors?
@tund not yet, I'll try to poke him again tomorrow
Thanks @gandroz
Hi, Thanks for developing this great tool kit. I had 2 questions about the conformer model :-
examples/conformer
, I think almost all the parameter are similar to conformer(S) of https://arxiv.org/pdf/2005.08100.pdf . However, the performance gap between the paper and conformer model inexamples/conformer
seems to be quite big (2.7 v/s 6.44 for test-clean). What do you think might be the reason for this?One reason I can see is that 2.7 is obtained with beam-search whereas 6.44 without. But I don't think just beam search can bring that difference. Can you give me some pointers on how can I reduce this gap? Also, Did you try decoding with beam search for
examples/conformer
?examples/conformer
with beam searchtest_subword_conformer.py
using the pre-trained model provided via drive. For this I just modifiedbeam-width
parameter in config.yml. But the decoding is taking very large time (about 30 min per batch, the total number of batches in test clean ~650) on Nvidia p40 with 24GB memory.Is this the expected behaviour or do I need to something more than changing
beam-width
from 0 to 4/8. What was the decoding time for you?Thanks, Abhinav