Closed stefan-falk closed 3 years ago
Hi @stefan-falk, you are free to ask any question :smile: Here are my answers.
So, if I get this right, what this does is, as we stream-decode, initialize each LSTM with its previous state from the last time-step. Is that correct?
=> Yes, that's correct. But we don't forward passing the last state manually because each lstm layer already does that for us, we have to keep track the last state of EACH lstm layer because the call
function of PredictionNetwork doesn't know the states of the previous batch (batch = 1
) since LSTM layers are stateless (each time-step is a batch). The reason why we don't use stateful LSTM (which is much easier because the LSTM layers save last states themselves) is because TFLite haven't supported stateful lstm yet.
To get it right, I see your EncoderNetwork Code and it seems like you pass the last state of the PREVIOUS LSTM layer to the beginning state of the NEXT LSTM layer. It doesn't seem right, because the recurrent layer is the layer that the state of current time-step depends on the state of the previous time-step, but the state of the first time-step of the NEXT LSTM layer and the state of the last time-step of the PREVIOUS LSTM layer are independent to each other :thinking:
Q: Why are we not storing the states of the EncoderNetwork like we do for the PredictionNetwork?
=> Because the EncoderNetwork doesn't have to have RNN Layers, if you read the newest SOTA paper (Conformer - https://arxiv.org/abs/2005.08100), you will see they use conv, feed-forward and self-attention to replace recurrent layers in the EncoderNetwork.
=> Yes it would make sense if you use recurrent layers in the EncoderNetwork and keep track the last states. So you can write your own Transducer
and override the recognize
method to store those states like the PredictionNetwork :laughing:
=> But I think it doesn't bring much effect because if you stream chunks of audio (250ms each), assume that you say "hello" but the first chunk only records "he", when the next chunk "llo" comes in, the PredictionNetwork already predicted "he" and it knows the last state is "e" => which, sort of, means that it knows the last time-step of previous features
is the character "e" => you don't have to store the EncoderNetwork states since the PredictionNetwork does that instead :smile: However, we will need to compare the results of storing EncoderNetwork states with the results of NOT storing those states to make sure it has any effects or increase accuracy or not.
=> Yes it would really really make sense if you want the EncoderNetwork to know what are the previous audio features, but hey, that's the "customization" :laughing: I only provide the general solutions so that people can make their custom models based on those solutions.
The second part concerns the input of the prediction network. I can see that you're prepending the ids with the blank (0) symbol. So [1, 2, 3] will be changed to [0, 1, 2 3]. Now, we're then also using Dataset.padded_batch in order to align examples and here we're also using the same blank symbol. This means the sample could end up looking something like this: [0, 1, 2, 3, 0, 0] - is this correct?
=> I'm not sure. The warprnnt_tensorflow
aka warp-transducer
requires the acts
having the shape [B, T, U+1, V]
where +1
means prepending the blank (0) symbol, so I prepend blank to the input of prediction network. I think padding with the blank symbols makes sense because what symbol represents "no audio" that more suitable than blank? (of course not \<space> :laughing: )
=> About one-hot encoded, I haven't read the paper carefully so I don't know that, but I think the warp-transducer
does remove the one-hot blank so that it could be a vector of zeros. If you want to know it badly, I suggest you read the code of warp-transducer
:rofl:
To get it right, I see your EncoderNetwork Code and it seems like you pass the last state of the PREVIOUS LSTM layer to the beginning state of the NEXT LSTM layer. It doesn't seem right, because the recurrent layer is the layer that the state of current time-step depends on the state of the previous time-step, but the state of the first time-step of the NEXT LSTM layer and the state of the last time-step of the PREVIOUS LSTM layer are independent to each other 🤔
I think you're correct. When stacking LSTMs each layer should have its own internal state based on the output sequence of the previous LSTM. What I am doing is passing forward the state of each LSTM to the next layer. I think I was thinking of a classic encoder-decoder model like below as I wrote that.
encoder_inputs = Input(shape=(None,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]
decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)
=> Because the EncoderNetwork doesn't have to have RNN Layers, if you read the newest SOTA paper (Conformer - https://arxiv.org/abs/2005.08100), you will see they use conv, feed-forward and self-attention to replace recurrent layers in the EncoderNetwork.
Ah I see - I assumed it was mainly because we can just replace the encoder by some other architecture but in the case of the transducer which uses RNNs I was just surprised to see that only the PredictionNetwork uses memory. :)
=> But I think it doesn't bring much effect because if you stream chunks of audio (250ms each), ....
I guess your argument is right. :)
I think padding with the blank symbols makes sense because what symbol represents "no audio" that more suitable than blank?
Yes, I agree on that. The paper states that they've done it that way :)
=> About one-hot encoded, I haven't read the paper carefully so I don't know that, but I think the warp-transducer does remove the one-hot blank so that it could be a vector of zeros. If you want to know it badly, I suggest you read the code of warp-transducer 🤣
Indeed, now that you mention it .. There is the blank_label
argument for warprnnt_tensorflow.rnnt_loss
:
'''Computes the RNNT loss between a sequence of activations and a
ground truth labeling.
Args:
...
blank_label: int, the label value/index that the RNNT
calculation should use as the blank label
...
'''
@usimarit thanks a lot for your answer and sharing this repository! :)
I'm going to close the issue 😃 👍
No problem, I'm glad that you like this repo :laughing:
@usimarit Hi again! 😄
So.. I've been running a few experiments on my own implementation which is largely inspired by rnnt-speech-recognition and TiramisuASR.
However, it seems that there's either something wrong with the model or the implementation of recognize()
.
This keeps me up at night as I am just not able to get something meaningful from the model e.g.
--
Predicted (reco): der undneunzig des undllung des undsiebzig des undsiebzig des undfünfzig �������artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel
Target: hast du eine schöne jacke für circa hundert dollar
--
Predicted (reco): der undneunzig des undllung des undsiebzig des undsiebzig des undfünfzig �������artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel
Target: ich habe etwas käse mitgebracht
--
...
My implementation is slightly different: I am using the time-reduction idea from the Google paper and also the layer-normalization as suggested by the paper. To make sure that this is not the problem I started different experiments where I disable these additional layers but they all look very much the same:
I have checked multiple times but I do not really see a significant difference between your and my own implementation. Or I don't know what I am missing here.
Hence, I'd like to know if your loss looks similar when you train your model. It's not very stable - I assume that's due to the very small batch size (4-6 samples).
Here is the code I am using in case you want to take a look:
Thank you for any insight! 😄
Hi @stefan-falk
In my experience, for the model to converge, the mean value of loss function like ctc_loss
and rnnt_loss
of about 200 batches must be smaller than 30. The value of your loss is high so I guess the model has not converged yet. I have trained a transducer for vietnamese and the loss reduced to 11.
for val_loss and 7.
for train_loss. I haven't test yet :laughing: but I don't think the recognize()
is the issue because everything seems very logical.
This is my current log of conformer:
I don't think the time reduction or layer norm or your implementation is the problem.
Did you load the trained weights? In the usage
code you give me, it doesn't seem like you load the trained weights :laughing:
So my thought is either the model hasn't converged or the weights weren't loaded.
I just tested and it seems like there is a problem with the recognize()
. I'm trying to find it :sob:
The recognize
is inspired from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/transducer/rnn_decoders.py but I don't find any issue :sob:
@usimarit Oh, alright so maybe it might not be the model?
Okay, I guess in that case I'll take a closer look as soon as I have the time. Unfortunately I'm busy over the weekend but I will try to find out what the issues is a let you know if I find something!
Just on the side: Do you know an implementation that can decode the entire model output? I just need to test whether the model is able to predict anything at all. The Algorithm 1 in https://arxiv.org/pdf/1211.3711.pdf should do the job. If not I'll implement it after the weekend. 😄
BR;
@stefan-falk The Algorithm 1 is Beam Search, which I implemented in recognize_beam()
but unfortunately it runs really slow. We should find a faster algorithm for Transducer Beam Search.
@stefan-falk I just found out that using mask_zero=True
in the Embedding
like you did makes sense and it works (well I think, I need to test more), and the final step is to merge the repeated characters (which I haven't implemented).
Hi @stefan-falk
Good news, I found the issue. It is recognize()
The issue is I was wrong because at the time-step, ONLY NON-BLANK predicted character is accepted (if it's blank, keep the previous hyp). So after I fixed it, the model predicts quite good, here is my example in vietnamese:
The code is updated on master (I dropped tf.py_function
and changed to tf.while_loop
for converting to tflite)
If you find the tf.while_loop
is difficult, here is the code from espnet
repo, you will see the pred != self.blank
condition:
def recognize(self, h, recog_args):
"""Greedy search implementation.
Args:
h (torch.Tensor): encoder hidden state sequences (Tmax, Henc)
recog_args (Namespace): argument Namespace containing options
Returns:
hyp (list of dicts): 1-best decoding results
"""
z_list, c_list = self.zero_state(h.unsqueeze(0))
ey = to_device(self, torch.zeros((1, self.embed_dim)))
hyp = {"score": 0.0, "yseq": [self.blank]}
y, (z_list, c_list) = self.rnn_forward(ey, (z_list, c_list))
for hi in h:
ytu = F.log_softmax(self.joint(hi, y[0]), dim=0)
logp, pred = torch.max(ytu, dim=0)
if pred != self.blank:
hyp["yseq"].append(int(pred))
hyp["score"] += float(logp)
eys = to_device(
self, torch.full((1, 1), hyp["yseq"][-1], dtype=torch.long)
)
ey = self.embed(eys)
y, (z_list, c_list) = self.rnn_forward(ey[0], (z_list, c_list))
return [hyp]
@usimarit Great work!
I wasn't 100% sure about whether or not mask_zero
makes sense but since my model didn't work everything came under suspicion of breaking things! 😆
Note: I think setting it
True
means also that theblank
label has to have the value 0 so should guarantee this I guess to avoid a bug?
It does seem however that there's still a problem with my own implementation. 😞 I've been using the tf.while_loop
code and it does work as such but my model tends to predict garbage. Also, that predicted text always contains similar words.
-
Predicted (reco): der mitte des vierzehnkeit jahrhunderte das entspfrei von den empfindete
Target: hast du eine schöne jacke für circa hundert dollar
-
Predicted (reco): der mitte des siebte jahrhunderts sante die bürgerzone renreich
Target: ich habe etwas käse mitgebracht
What kind of batch size are you using? My batch size is rather small (2-4 samples) and it does not seem that the loss drops that low after just 200 batches (which means 200 to 800 examples in my case).
Are you plotting the average loss over time? In my plot you see the average loss for each badge. I am not averaging that over time hence it's not that smooth.
My loss looks like this after training over the weekend:
However, I don't really think that's the problem.
I think I'll have to continue to test before I get my cake 🍰
@stefan-falk I'm using batch size 4, the conformer is trained using google colab for 22 hours on the dataset contains 1-10 seconds each audio.
In my experience, for the model to converge, the mean value of loss function like ctc_loss and rnnt_loss of about 200 batches must be smaller than 30.
=> What i mean is I only plot the average loss of about 200 batches (before computing batch 201, the tf.keras.metrics.Mean
will be resetted to 0) and the 30 value is not the "first" 200 batches but it's the loss value after like 15 epochs (the loss values depend on the data). In general, the last loss value must be like < 12 for rnnt_loss
:smile: in my case. You should create "validation" dataset because it will be easier to see if the model is converged or not. When the model converges, the line of "validation loss" and "train loss" will meet each other and then after that the line of validation loss will be higher due to overfitting. And if you use only train data, the logs of losses has no meaning (in my opinion, the purpose of logging losses is to see if the model is converged or not, and to do that, it requires a validation data)
Note: I think setting it True means also that the blank label has to have the value 0 so should guarantee this I guess to avoid a bug?
=> I've tested the mask
option and no mask
option in Embedding layer. This is the results using recognize()
:
WER (%) | CER (%) | |
---|---|---|
_Nomasked | 39.8549805 | 21.5941677 |
Masked | 35.3230591 | 18.7023373 |
=> It seems like the mask
option brings better results. However, I used the model trained using mask option then disable the mask and load the weights of the trained model using mask option
=> The trained masked model learned to ignore the prepended blank index where the new "no mask" model reusing that weights does not ignore.
=> Therefore, I think if you use mask
option, then the blank must always be 0. And if you want to use other blank index, don't use mask
option and let the model learn the prepended blank.
I think you should check these:
.lower()
Your implementation still seems fine for me. And I can't find any idea how you got those results :disappointed:
The best solution I can give you is that I suggest you spend some time to check again the whole pipeline such as data, preprocessing audio, preprocessing text, models, recognize, postprocessing from index to characters.
=> What i mean is I only plot the average loss of about 200 batches (before computing batch 201, the tf.keras.metrics.Mean will be resetted to 0) and the 30 value is not the "first" 200 batches but it's the loss value after like 15 epochs (the loss values depend on the data). In general, the last loss value must be like < 12 for rnnt_loss 😄 in my case.
Yes, I thought that you're probably plotting some average :) Thank you for those numbers - it gives me some orientation of where I should land 🤣
You should create "validation" dataset because it will be easier to see if the model is converged or not. When the model converges, the line of "validation loss" and "train loss" will meet each other and then after that the line of validation loss will be higher due to overfitting.
I have a validation dataset but I disabled it during development. I'll start another training and see what the validation loss looks like. Last time I checked it was "fine" in the sense that it converged like the training loss.
=> I've tested the mask option and no mask option in Embedding layer. This is the results using recognize():
Alright, mask_zero
it is! It really seems to improve the model.
=> Therefore, I think if you use mask option, then the blank must always be 0
Agree. :)
I think you should check these: ..
I am afraid you are right. I'll have to go through all these things and check if they work. I am convinced that they should work but it's obviously not the case.
I just have one question regarding
Embedding_size is usually greater than vocab_size (i.e 256)
So .. the entire time I was (re)using a dataset which already has a vocabulary - that vocab size is around 5000 subwords. Do you think that this could be the problem? As far as I know it should work with a larger vocabulary (subwords instead of characters) as well but maybe that's the problem?
Your implementation still seems fine for me. And I can't find any idea how you got those results 😞
Thank you for taking a look! 👍
Small update:
This is another training I startet over night. I have increased the model size which seems to be helpful:
The loss is not dropping as fast as yours but to me it looks like the model converges as it should.
However, the issue is still there when I evaluate the model in a separate script. So the issue might be in that script but that just does not make any sense because in that script I am just loading the data and send it to the model 🤷♂️ what am I doing wrong 🤣
@stefan-falk I used learning rate schedule, that's why the loss decreased so fast :laughing:
I think for now, the model converged so the issue lies in either the recognize()
or the test data doesn't match between audio and labels. I'm pretty sure you do it right for the test data. Maybe you can reuse my new recognize()
to see if it solves the problem. Because the recognize_beam()
I implemented using for and dict like that still causes repeated characters.
@usimarit I think I have a candidate for the issue.
So.. I've been re-using a dataset the whole time and with it comes some preprocessing logic that is responsible for computing the MFCC features from the audio. I ported that code from TF1 but I didn't think about one last step that's inside that routine: Apply convolutions to the input.
I think what happens here is that I am applying the preprocessing logic here during training but those convolutions I mentioned are not part of the model hence never get loaded and/or applied:
# apply_convolutions()
mel_fbanks.set_shape([None, None, num_mel_bins, num_channels])
mel_fbanks = tf.pad(mel_fbanks, [[0, 0], [0, 8], [0, 0], [0, 0]])
for _ in range(2):
mel_fbanks = tf.compat.v1.layers.conv2d(mel_fbanks, 128, (3, 3), (2, 2), use_bias=False)
mel_fbanks = layer_norm(mel_fbanks)
mel_fbanks = tf.nn.relu(mel_fbanks)
mel_fbanks_shape = mel_fbanks.get_shape().as_list()
# Apply a convolution that will remove all frequencies and at the same time
# project the output into desired hidden_size
mel_fbanks = tf.pad(mel_fbanks, [[0, 0], [0, 2], [0, 0], [0, 0]])
mel_fbanks = tf.compat.v1.layers.conv2d(mel_fbanks, hidden_size, (3, mel_fbanks_shape[2]))
assert mel_fbanks.get_shape().as_list()[2] == 1
mel_fbanks = layer_norm(mel_fbanks)
mel_fbanks = tf.nn.relu(mel_fbanks)
So, I think those layers are just "not there" as I try to evaluate the model. 🤦
I'll try to move this part inside the model as an additional layer of the EncoderNetwork
, retrain the whole thing and hope that this is indeed the issue.
@usimarit As expected: This was indeed causing the issue ^^ the model finally works and produces something useful. Just wanted to let you know and say thanks again for your support! 😄
@stefan-falk No problem :laughing: I'm gonna close the issue here :+1:
I am trying to understand how the streaming-decode works. There's a few things which I am not sure whether I completely understand them so I hope it's okay if I'm asking here.
The first part concerns the memory of the prediction network. In
TransducerPrediction
I see that there's two argumentsp_memory_states
andp_carry_states
.These arguments are used in
Tramsducer.perform_greedy
to initialize the states of the LSTM-stack during prediction/recognition.So, if I get this right, what this does is, as we stream-decode, initialize each LSTM with its previous state from the last time-step. Is that correct?
And, we have to keep track of each individual layer (instead of forward passing the last state) because during streaming-decode we're essentially looking at only one time slice every time we run the model:
I think I understand this part so far but:
Q: Why are we not storing the states of the EncoderNetwork like we do for the PredictionNetwork?
If we're streaming, where
features
are the spectrogram features, wouldn't it make sense to also keep the internal LSTM-state(s) of the encoder?My own implementation of the model is slightly different: The encoder-network is a stack of LSTMs whereas in your example you're only using one LSTM. But in both cases we're having internal states which we're not carrying along for
Transducer.recognize
and not sure if I understand why this is the case.EncoderNetwork Code (click to expand)
```python class EncoderNetwork(network.Network): def __init__( self, num_layers: int, lstm_units: int, time_reduction_index: int = None, time_reduction_factor: int = 2, dropout: float = 0, *args, **kwargs ): super().__init__(*args, **kwargs) self.reduction_index = time_reduction_index self.reduction_factor = time_reduction_factor self.lstm_stack = list() for i in range(num_layers): lstm = layers.LSTM( units=lstm_units, return_sequences=True, return_state=True, dropout=dropout ) norm = layers.LayerNormalization() self.lstm_stack.append((lstm, norm)) if self.reduction_index: self.time_reduction = TimeReduction(self.reduction_factor) def call(self, inputs, training=None, mask=None): x = inputs states = None for i, (lstm, norm) in enumerate(self.lstm_stack): x, state_h, state_c = lstm(x, initial_state=states) x = norm(x) states = state_h, state_c if self.reduction_index and i == self.reduction_index: x = self.time_reduction(x) return x ```Shouldn't we keep those states as well? What if I stream the first 2 seconds of an audio and then the next 2 seconds and so on. Shouldn't we keep track of the state for the EncoderNetwork as well in that case?
The second part concerns the input of the prediction network. I can see that you're prepending the ids with the
blank
(0) symbol. So[1, 2, 3]
will be changed to[0, 1, 2 3]
. Now, we're then also usingDataset.padded_batch
in order to align examples and here we're also using the sameblank
symbol. This means the sample could end up looking something like this:[0, 1, 2, 3, 0, 0]
- is this correct? One-hot encoded this would take the form:I am asking this because in https://arxiv.org/pdf/1211.3711.pdf the blank-symbol is actually a vector containing all zeros:
and I was wondering whether this could make a difference?
Thank you for shedding any light on this :)