TensorSpeech / TensorFlowASR

:zap: TensorFlowASR: Almost State-of-the-art Automatic Speech Recognition in Tensorflow 2. Supported languages that can use characters or subwords
https://huylenguyen.com/asr
Apache License 2.0
918 stars 242 forks source link

Question: Transducer.recognize for streaming-decode #13

Closed stefan-falk closed 3 years ago

stefan-falk commented 4 years ago

I am trying to understand how the streaming-decode works. There's a few things which I am not sure whether I completely understand them so I hope it's okay if I'm asking here.

The first part concerns the memory of the prediction network. In TransducerPrediction I see that there's two arguments p_memory_states and p_carry_states.

outputs = self.embed(inputs, training=training)
outputs = self.do(outputs, training=training)

n_memory_states = []
n_carry_states = []

for i, lstm in enumerate(self.lstms):
    initial_state = [p_memory_states[i], p_carry_states[i]] if has_memories else None

    outputs, new_memory_state, new_carry_state = lstm(outputs, training=training, initial_state=initial_state)

    n_memory_states.append(tf.expand_dims(new_memory_state, 0))
    n_carry_states.append(tf.expand_dims(new_carry_state, 0))

return outputs, tf.concat(n_memory_states, axis=0), tf.concat(n_carry_states, axis=0)

These arguments are used in Tramsducer.perform_greedy to initialize the states of the LSTM-stack during prediction/recognition.

So, if I get this right, what this does is, as we stream-decode, initialize each LSTM with its previous state from the last time-step. Is that correct?

And, we have to keep track of each individual layer (instead of forward passing the last state) because during streaming-decode we're essentially looking at only one time slice every time we run the model:

hi = tf.reshape(enc[i], [1, 1, -1])  # <-- Take the i-th slice of the encoder output
y, n_memory_states, n_carry_states = self.predict_network(
    inputs=tf.reshape(new_hyps[0]["yseq"][-1], [1, 1]),  # <-- Take the previously predicted symbol
    p_memory_states=new_hyps[0]["p_memory_states"],
    p_carry_states=new_hyps[0]["p_carry_states"],
    has_memories=new_hyps[0]["has_memories"],
    training=False
)

I think I understand this part so far but:

Q: Why are we not storing the states of the EncoderNetwork like we do for the PredictionNetwork?

If we're streaming, where features are the spectrogram features, wouldn't it make sense to also keep the internal LSTM-state(s) of the encoder?

My own implementation of the model is slightly different: The encoder-network is a stack of LSTMs whereas in your example you're only using one LSTM. But in both cases we're having internal states which we're not carrying along for Transducer.recognize and not sure if I understand why this is the case.

EncoderNetwork Code (click to expand) ```python class EncoderNetwork(network.Network): def __init__( self, num_layers: int, lstm_units: int, time_reduction_index: int = None, time_reduction_factor: int = 2, dropout: float = 0, *args, **kwargs ): super().__init__(*args, **kwargs) self.reduction_index = time_reduction_index self.reduction_factor = time_reduction_factor self.lstm_stack = list() for i in range(num_layers): lstm = layers.LSTM( units=lstm_units, return_sequences=True, return_state=True, dropout=dropout ) norm = layers.LayerNormalization() self.lstm_stack.append((lstm, norm)) if self.reduction_index: self.time_reduction = TimeReduction(self.reduction_factor) def call(self, inputs, training=None, mask=None): x = inputs states = None for i, (lstm, norm) in enumerate(self.lstm_stack): x, state_h, state_c = lstm(x, initial_state=states) x = norm(x) states = state_h, state_c if self.reduction_index and i == self.reduction_index: x = self.time_reduction(x) return x ```

Shouldn't we keep those states as well? What if I stream the first 2 seconds of an audio and then the next 2 seconds and so on. Shouldn't we keep track of the state for the EncoderNetwork as well in that case?


The second part concerns the input of the prediction network. I can see that you're prepending the ids with the blank (0) symbol. So [1, 2, 3] will be changed to [0, 1, 2 3]. Now, we're then also using Dataset.padded_batch in order to align examples and here we're also using the same blank symbol. This means the sample could end up looking something like this: [0, 1, 2, 3, 0, 0] - is this correct? One-hot encoded this would take the form:

[1, 0, 0, 0],
[0, 1, 0, 0],
[0, 0, 1, 0],
[0, 0, 0, 1],
[1, 0, 0, 0],
[1, 0, 0, 0],

I am asking this because in https://arxiv.org/pdf/1211.3711.pdf the blank-symbol is actually a vector containing all zeros:

[0, 0, 0],
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 0, 0],
[0, 0, 0],

and I was wondering whether this could make a difference?


Thank you for shedding any light on this :)

nglehuy commented 4 years ago

Hi @stefan-falk, you are free to ask any question :smile: Here are my answers.

So, if I get this right, what this does is, as we stream-decode, initialize each LSTM with its previous state from the last time-step. Is that correct?

=> Yes, that's correct. But we don't forward passing the last state manually because each lstm layer already does that for us, we have to keep track the last state of EACH lstm layer because the call function of PredictionNetwork doesn't know the states of the previous batch (batch = 1) since LSTM layers are stateless (each time-step is a batch). The reason why we don't use stateful LSTM (which is much easier because the LSTM layers save last states themselves) is because TFLite haven't supported stateful lstm yet.

To get it right, I see your EncoderNetwork Code and it seems like you pass the last state of the PREVIOUS LSTM layer to the beginning state of the NEXT LSTM layer. It doesn't seem right, because the recurrent layer is the layer that the state of current time-step depends on the state of the previous time-step, but the state of the first time-step of the NEXT LSTM layer and the state of the last time-step of the PREVIOUS LSTM layer are independent to each other :thinking:

Q: Why are we not storing the states of the EncoderNetwork like we do for the PredictionNetwork?

=> Because the EncoderNetwork doesn't have to have RNN Layers, if you read the newest SOTA paper (Conformer - https://arxiv.org/abs/2005.08100), you will see they use conv, feed-forward and self-attention to replace recurrent layers in the EncoderNetwork.

=> Yes it would make sense if you use recurrent layers in the EncoderNetwork and keep track the last states. So you can write your own Transducer and override the recognize method to store those states like the PredictionNetwork :laughing:

=> But I think it doesn't bring much effect because if you stream chunks of audio (250ms each), assume that you say "hello" but the first chunk only records "he", when the next chunk "llo" comes in, the PredictionNetwork already predicted "he" and it knows the last state is "e" => which, sort of, means that it knows the last time-step of previous features is the character "e" => you don't have to store the EncoderNetwork states since the PredictionNetwork does that instead :smile: However, we will need to compare the results of storing EncoderNetwork states with the results of NOT storing those states to make sure it has any effects or increase accuracy or not.

=> Yes it would really really make sense if you want the EncoderNetwork to know what are the previous audio features, but hey, that's the "customization" :laughing: I only provide the general solutions so that people can make their custom models based on those solutions.

The second part concerns the input of the prediction network. I can see that you're prepending the ids with the blank (0) symbol. So [1, 2, 3] will be changed to [0, 1, 2 3]. Now, we're then also using Dataset.padded_batch in order to align examples and here we're also using the same blank symbol. This means the sample could end up looking something like this: [0, 1, 2, 3, 0, 0] - is this correct?

=> I'm not sure. The warprnnt_tensorflow aka warp-transducer requires the acts having the shape [B, T, U+1, V] where +1 means prepending the blank (0) symbol, so I prepend blank to the input of prediction network. I think padding with the blank symbols makes sense because what symbol represents "no audio" that more suitable than blank? (of course not \<space> :laughing: )

=> About one-hot encoded, I haven't read the paper carefully so I don't know that, but I think the warp-transducer does remove the one-hot blank so that it could be a vector of zeros. If you want to know it badly, I suggest you read the code of warp-transducer :rofl:

stefan-falk commented 4 years ago

To get it right, I see your EncoderNetwork Code and it seems like you pass the last state of the PREVIOUS LSTM layer to the beginning state of the NEXT LSTM layer. It doesn't seem right, because the recurrent layer is the layer that the state of current time-step depends on the state of the previous time-step, but the state of the first time-step of the NEXT LSTM layer and the state of the last time-step of the PREVIOUS LSTM layer are independent to each other 🤔

I think you're correct. When stacking LSTMs each layer should have its own internal state based on the output sequence of the previous LSTM. What I am doing is passing forward the state of each LSTM to the next layer. I think I was thinking of a classic encoder-decoder model like below as I wrote that.

encoder_inputs = Input(shape=(None,))
x = Embedding(num_encoder_tokens, latent_dim)(encoder_inputs)
x, state_h, state_c = LSTM(latent_dim, return_state=True)(x)
encoder_states = [state_h, state_c]

decoder_inputs = Input(shape=(None,))
x = Embedding(num_decoder_tokens, latent_dim)(decoder_inputs)
x = LSTM(latent_dim, return_sequences=True)(x, initial_state=encoder_states)
decoder_outputs = Dense(num_decoder_tokens, activation='softmax')(x)

=> Because the EncoderNetwork doesn't have to have RNN Layers, if you read the newest SOTA paper (Conformer - https://arxiv.org/abs/2005.08100), you will see they use conv, feed-forward and self-attention to replace recurrent layers in the EncoderNetwork.

Ah I see - I assumed it was mainly because we can just replace the encoder by some other architecture but in the case of the transducer which uses RNNs I was just surprised to see that only the PredictionNetwork uses memory. :)

=> But I think it doesn't bring much effect because if you stream chunks of audio (250ms each), ....

I guess your argument is right. :)

I think padding with the blank symbols makes sense because what symbol represents "no audio" that more suitable than blank?

Yes, I agree on that. The paper states that they've done it that way :)

=> About one-hot encoded, I haven't read the paper carefully so I don't know that, but I think the warp-transducer does remove the one-hot blank so that it could be a vector of zeros. If you want to know it badly, I suggest you read the code of warp-transducer 🤣

Indeed, now that you mention it .. There is the blank_label argument for warprnnt_tensorflow.rnnt_loss:

'''Computes the RNNT loss between a sequence of activations and a
ground truth labeling.
Args:
    ...
    blank_label: int, the label value/index that the RNNT
                 calculation should use as the blank label
    ...
'''

@usimarit thanks a lot for your answer and sharing this repository! :)

stefan-falk commented 4 years ago

I'm going to close the issue 😃 👍

nglehuy commented 4 years ago

No problem, I'm glad that you like this repo :laughing:

stefan-falk commented 4 years ago

@usimarit Hi again! 😄

So.. I've been running a few experiments on my own implementation which is largely inspired by rnnt-speech-recognition and TiramisuASR.

However, it seems that there's either something wrong with the model or the implementation of recognize().

This keeps me up at night as I am just not able to get something meaningful from the model e.g.

--
Predicted (reco):  der undneunzig des undllung des undsiebzig des undsiebzig des undfünfzig �������artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel 
Target:            hast du eine schöne jacke für circa hundert dollar 
--
Predicted (reco):  der undneunzig des undllung des undsiebzig des undsiebzig des undfünfzig �������artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel artikel 
Target:            ich habe etwas käse mitgebracht
--
... 

My implementation is slightly different: I am using the time-reduction idea from the Google paper and also the layer-normalization as suggested by the paper. To make sure that this is not the problem I started different experiments where I disable these additional layers but they all look very much the same:

image

I have checked multiple times but I do not really see a significant difference between your and my own implementation. Or I don't know what I am missing here.

Hence, I'd like to know if your loss looks similar when you train your model. It's not very stable - I assume that's due to the very small batch size (4-6 samples).

Here is the code I am using in case you want to take a look:

Transducer Code (click to expand) ```python import tensorflow as tf from tensorflow import keras from tensorflow.keras import layers from tensorflow.python.keras.engine import network from warprnnt_tensorflow import rnnt_loss class TimeReduction(tf.keras.layers.Layer): def __init__(self, reduction_factor, batch_size=None, **kwargs): super(TimeReduction, self).__init__(**kwargs) self.reduction_factor = reduction_factor self.batch_size = batch_size def call(self, inputs, **kwargs): input_shape = tf.shape(inputs) batch_size = self.batch_size if batch_size is None: batch_size = input_shape[0] max_time = input_shape[1] num_units = inputs.get_shape().as_list()[-1] outputs = inputs paddings = [[0, 0], [0, tf.math.floormod(max_time, self.reduction_factor)], [0, 0]] outputs = tf.pad(outputs, paddings) return tf.reshape(outputs, (batch_size, -1, num_units * self.reduction_factor)) class PredictionNetwork(network.Network): def __init__( self, vocab_size: int, embedding_size: int, num_layers: int, lstm_units: int, dropout: float, enable_layer_norm: bool, *args, **kwargs ): super().__init__(*args, **kwargs) self.enable_layer_norm = enable_layer_norm self.embedding = layers.Embedding(vocab_size, embedding_size, mask_zero=True) # TODO mask_zero? self.lstm_stack = list() for _ in range(num_layers): lstm = layers.LSTM( units=lstm_units, return_sequences=True, return_state=True, dropout=dropout ) # LayerNormalization acc. to https://arxiv.org/pdf/1811.06621.pdf norm = layers.LayerNormalization() if self.enable_layer_norm else None self.lstm_stack.append((lstm, norm)) def call(self, inputs, training=None, p_memory_states=None, p_carry_states=None, **kwargs): has_memory = p_memory_states is not None and p_carry_states is not None x = self.embedding(inputs, training=training) n_memory_states = [] n_carry_states = [] for i, (lstm, norm) in enumerate(self.lstm_stack): initial_state = [p_memory_states[i], p_carry_states[i]] if has_memory else None x, state_h, state_c = lstm(x, training=training, initial_state=initial_state) n_memory_states.append(state_h) n_carry_states.append(state_c) if self.enable_layer_norm: x = norm(x, training=training) return x, n_memory_states, n_carry_states class EncoderNetwork(network.Network): def __init__( self, num_layers: int, lstm_units: int, time_reduction_index: [int, None], time_reduction_factor: int, dropout: float, enable_layer_norm: bool, *args, **kwargs ): super().__init__(*args, **kwargs) assert time_reduction_index is None or time_reduction_index < num_layers, \ 'Error (%d < %d): time_reduction_index must be less than num_layers' % (time_reduction_index, num_layers) self.reduction_index = time_reduction_index self.reduction_factor = time_reduction_factor self.enable_layer_norm = enable_layer_norm self.lstm_stack = list() for i in range(num_layers): lstm = layers.LSTM( units=lstm_units, return_sequences=True, return_state=False, dropout=dropout ) # LayerNormalization acc. to https://arxiv.org/pdf/1811.06621.pdf norm = layers.LayerNormalization() if self.enable_layer_norm else None self.lstm_stack.append((lstm, norm)) if self.reduction_index: self.time_reduction = TimeReduction(self.reduction_factor) def call(self, inputs, training=None, mask=None): x = inputs for i, (lstm, norm) in enumerate(self.lstm_stack): x = lstm(x, training=training) if self.enable_layer_norm: x = norm(x, training=training) if self.reduction_index and i == self.reduction_index: x = self.time_reduction(x, training=training) return x class JoinNetwork(network.Network): def __init__(self, vocab_size: int, join_size: int, *args, **kwargs): super().__init__(*args, **kwargs) self.encoder_projection = layers.Dense(join_size) self.predict_projection = layers.Dense(join_size) # self.join_dense = layers.Dense(join_size, activation='relu') self.output_dense = layers.Dense(vocab_size) def call(self, inputs, training=None, mask=None): encoder_outputs, predict_outputs = inputs # [B, T, E] => [B, T, J] encoder_outputs = self.encoder_projection(encoder_outputs, training=training) # [B, U, P] => [B, U, J] predict_outputs = self.predict_projection(predict_outputs, training=training) x = ( # [B, T, J] => [B, T, 1, J] tf.expand_dims(encoder_outputs, axis=2) + # [B, U, J] => [B, 1, U, J] tf.expand_dims(predict_outputs, axis=1) ) # x = self.join_dense(x, training=training) x = tf.nn.tanh(x) # [B, T, U, J] => [B, T, U, V] x = self.output_dense(x, training=training) return x class Transducer(keras.Model): def __init__( self, vocab_size: int, embedding_size: int, encoder_num_layers: int, encoder_hidden_size: int, predict_num_layers: int, predict_lstm_units: int, join_size: int, encoder_time_reduction_index, encoder_time_reduction_factor, dropout, enable_layer_norm: bool, *args, **kwargs ): super().__init__(*args, **kwargs) self.vocab_size = vocab_size self.encoder_time_reduction_factor = encoder_time_reduction_factor # TODO The encoder network can be any other seq2seq-architecture (https://arxiv.org/abs/2005.08100) # TODO Make the encoder_network an argument of __init__ but make sure this works with get_config() self.encoder_network = EncoderNetwork( num_layers=encoder_num_layers, lstm_units=encoder_hidden_size, time_reduction_index=encoder_time_reduction_index, time_reduction_factor=encoder_time_reduction_factor, dropout=dropout, enable_layer_norm=enable_layer_norm ) self.predict_network = PredictionNetwork( vocab_size=vocab_size, embedding_size=embedding_size, num_layers=predict_num_layers, lstm_units=predict_lstm_units, dropout=dropout, enable_layer_norm=enable_layer_norm, ) self.join_network = JoinNetwork( vocab_size=vocab_size, join_size=join_size ) self.train_loss = keras.metrics.Mean() self.valid_loss = keras.metrics.Mean() def call(self, inputs, training=None, mask=None): encoder_inputs, prediction_inputs = inputs encoder_outputs = self.encoder_network(encoder_inputs, training=training) predict_outputs, _, _ = self.predict_network(prediction_inputs, training=training) return self.join_network((encoder_outputs, predict_outputs), training=training) def train_step(self, data): (inputs, input_lengths, label_lengths), y_true = data loss, gradients = rnnt_gradient( model=self, inputs=inputs, y_true=y_true, input_lengths=input_lengths, label_lengths=label_lengths, time_reduction_factor=self.encoder_time_reduction_factor ) self.optimizer.apply_gradients(zip(gradients, self.trainable_variables)) self.train_loss.update_state(loss) return {'loss': loss} def test_step(self, data): (inputs, input_lengths, label_lengths), y_true = data y_pred = self(inputs, training=False) losses = rnnt_loss( acts=y_pred, labels=y_true, input_lengths=input_lengths, label_lengths=label_lengths ) loss = tf.reduce_mean(losses) self.valid_loss.update_state(loss) return {'loss': loss} def recognize(self, features, kept_hyps, streaming=False, blank=0): def perform_greedy_(sample): return self.perform_greedy(tf.expand_dims(sample, 0), kept_hyps, streaming, blank=blank) return tf.map_fn(perform_greedy_, features, dtype=tf.int32) def perform_greedy(self, features, kept_hyps, streaming, blank=0): if kept_hyps is None or not streaming: kept_hyps = [ { 'score': tf.constant(0.0), 'yseq': [blank], 'p_memory_states': None, 'p_carry_states': None, 'has_memories': False } ] # [T, E] encoder_outputs = tf.squeeze(self.encoder_network(features, training=False), axis=0) new_hyps = kept_hyps for i in range(shape_list(encoder_outputs)[0]): # [1, 1, E] encoder_output = tf.reshape(encoder_outputs[i], [1, 1, -1]) predict_output, n_memory_states, n_carry_states = self.predict_network( inputs=tf.reshape(new_hyps[0]['yseq'][-1], [1, 1]), p_memory_states=new_hyps[0]['p_memory_states'], p_carry_states=new_hyps[0]['p_carry_states'], training=False ) # join([1, 1, E], [1, 1, P]) => [1, 1, 1, V] join_output = tf.nn.log_softmax(self.join_network([encoder_output, predict_output], training=False)) # [1, 1, V] => [V] join_output = tf.squeeze(join_output) # Get predicted ID predicted_id = tf.argmax(join_output, axis=0, output_type=tf.int32) hyps = [ { 'score': new_hyps[0]['score'] + join_output[predicted_id], 'yseq': new_hyps[0]['yseq'] + [predicted_id], 'p_memory_states': n_memory_states, 'p_carry_states': n_carry_states, 'has_memories': True } ] new_hyps = hyps return tf.convert_to_tensor(new_hyps[0]['yseq'], dtype=tf.int32) @classmethod def from_config(cls, config, custom_objects=None): return cls(**config) def rnnt_gradient( model: Transducer, inputs: tuple, y_true: tf.Tensor, input_lengths: tf.Tensor, label_lengths: tf.Tensor, time_reduction_factor: [int, None], ): """Calculate the gradient for a Transducer training-batch of a given Transducer-model. :param model: :param inputs: :param y_true: :param input_lengths: :param label_lengths: :param time_reduction_factor: :return: A tuple (loss, gradients). """ with tf.GradientTape() as tape: y_pred = model(inputs, training=True) if not tf.test.is_built_with_cuda(): # If rnnt_loss() was not compiled with GPU support log_softmax() has to be called on logits (see docs). y_pred = tf.nn.log_softmax(y_pred) if time_reduction_factor: input_lengths = tf.cast(tf.math.ceil(input_lengths / time_reduction_factor), dtype=tf.int32) losses = rnnt_loss( acts=y_pred, labels=y_true, input_lengths=input_lengths, label_lengths=label_lengths ) loss = tf.reduce_mean(losses) return loss, tape.gradient(loss, model.trainable_variables) def shape_list(x): """Deal with dynamic shape in tensorflow cleanly.""" static = x.shape.as_list() dynamic = tf.shape(x) return [dynamic[i] if s is None else s for i, s in enumerate(static)] ```
Usage (click to expand) ```python def main(): vocab_size = 50 embedding_size = 16 transducer = Transducer( vocab_size=vocab_size, embedding_size=embedding_size, encoder_num_layers=4, encoder_hidden_size=20, predict_num_layers=2, predict_lstm_units=100, join_size=50, encoder_time_reduction_index=1, encoder_time_reduction_factor=2, dropout=0.5, enable_layer_norm=True ) import numpy as np batch_size = 1 enc_inputs = np.random.rand(batch_size, 30, 80) pre_inputs = np.asarray([[1, 2, 3, 0] for _ in range(batch_size)]) inputs = enc_inputs, pre_inputs pred = transducer(inputs) print(pred) if __name__ == '__main__': main() ````

Thank you for any insight! 😄

nglehuy commented 4 years ago

Hi @stefan-falk

In my experience, for the model to converge, the mean value of loss function like ctc_loss and rnnt_loss of about 200 batches must be smaller than 30. The value of your loss is high so I guess the model has not converged yet. I have trained a transducer for vietnamese and the loss reduced to 11. for val_loss and 7. for train_loss. I haven't test yet :laughing: but I don't think the recognize() is the issue because everything seems very logical.

This is my current log of conformer:

loss

I don't think the time reduction or layer norm or your implementation is the problem.

Did you load the trained weights? In the usage code you give me, it doesn't seem like you load the trained weights :laughing:

So my thought is either the model hasn't converged or the weights weren't loaded.

nglehuy commented 4 years ago

I just tested and it seems like there is a problem with the recognize(). I'm trying to find it :sob: The recognize is inspired from https://github.com/espnet/espnet/blob/master/espnet/nets/pytorch_backend/transducer/rnn_decoders.py but I don't find any issue :sob:

stefan-falk commented 4 years ago

@usimarit Oh, alright so maybe it might not be the model?

Okay, I guess in that case I'll take a closer look as soon as I have the time. Unfortunately I'm busy over the weekend but I will try to find out what the issues is a let you know if I find something!

Just on the side: Do you know an implementation that can decode the entire model output? I just need to test whether the model is able to predict anything at all. The Algorithm 1 in https://arxiv.org/pdf/1211.3711.pdf should do the job. If not I'll implement it after the weekend. 😄

BR;

nglehuy commented 4 years ago

@stefan-falk The Algorithm 1 is Beam Search, which I implemented in recognize_beam() but unfortunately it runs really slow. We should find a faster algorithm for Transducer Beam Search.

nglehuy commented 4 years ago

@stefan-falk I just found out that using mask_zero=True in the Embedding like you did makes sense and it works (well I think, I need to test more), and the final step is to merge the repeated characters (which I haven't implemented).

nglehuy commented 3 years ago

Hi @stefan-falk

Good news, I found the issue. It is recognize()

The issue is I was wrong because at the time-step, ONLY NON-BLANK predicted character is accepted (if it's blank, keep the previous hyp). So after I fixed it, the model predicts quite good, here is my example in vietnamese:

Screenshot from 2020-07-26 23-43-38

The code is updated on master (I dropped tf.py_function and changed to tf.while_loop for converting to tflite)

If you find the tf.while_loop is difficult, here is the code from espnet repo, you will see the pred != self.blank condition:

def recognize(self, h, recog_args):
        """Greedy search implementation.
        Args:
            h (torch.Tensor): encoder hidden state sequences (Tmax, Henc)
            recog_args (Namespace): argument Namespace containing options
        Returns:
            hyp (list of dicts): 1-best decoding results
        """
        z_list, c_list = self.zero_state(h.unsqueeze(0))
        ey = to_device(self, torch.zeros((1, self.embed_dim)))

        hyp = {"score": 0.0, "yseq": [self.blank]}

        y, (z_list, c_list) = self.rnn_forward(ey, (z_list, c_list))

        for hi in h:
            ytu = F.log_softmax(self.joint(hi, y[0]), dim=0)
            logp, pred = torch.max(ytu, dim=0)

            if pred != self.blank:
                hyp["yseq"].append(int(pred))
                hyp["score"] += float(logp)

                eys = to_device(
                    self, torch.full((1, 1), hyp["yseq"][-1], dtype=torch.long)
                )
                ey = self.embed(eys)

                y, (z_list, c_list) = self.rnn_forward(ey[0], (z_list, c_list))

        return [hyp]
stefan-falk commented 3 years ago

@usimarit Great work!

I wasn't 100% sure about whether or not mask_zero makes sense but since my model didn't work everything came under suspicion of breaking things! 😆

Note: I think setting it True means also that the blank label has to have the value 0 so should guarantee this I guess to avoid a bug?

It does seem however that there's still a problem with my own implementation. 😞 I've been using the tf.while_loop code and it does work as such but my model tends to predict garbage. Also, that predicted text always contains similar words.

-
Predicted (reco):  der mitte des vierzehnkeit jahrhunderte das entspfrei von den empfindete 
Target:            hast du eine schöne jacke für circa hundert dollar 

-
Predicted (reco):  der mitte des siebte jahrhunderts sante die bürgerzone renreich 
Target:            ich habe etwas käse mitgebracht 

What kind of batch size are you using? My batch size is rather small (2-4 samples) and it does not seem that the loss drops that low after just 200 batches (which means 200 to 800 examples in my case).

Are you plotting the average loss over time? In my plot you see the average loss for each badge. I am not averaging that over time hence it's not that smooth.

My loss looks like this after training over the weekend:

image

However, I don't really think that's the problem.

I think I'll have to continue to test before I get my cake 🍰

nglehuy commented 3 years ago

@stefan-falk I'm using batch size 4, the conformer is trained using google colab for 22 hours on the dataset contains 1-10 seconds each audio.

In my experience, for the model to converge, the mean value of loss function like ctc_loss and rnnt_loss of about 200 batches must be smaller than 30.

=> What i mean is I only plot the average loss of about 200 batches (before computing batch 201, the tf.keras.metrics.Mean will be resetted to 0) and the 30 value is not the "first" 200 batches but it's the loss value after like 15 epochs (the loss values depend on the data). In general, the last loss value must be like < 12 for rnnt_loss :smile: in my case. You should create "validation" dataset because it will be easier to see if the model is converged or not. When the model converges, the line of "validation loss" and "train loss" will meet each other and then after that the line of validation loss will be higher due to overfitting. And if you use only train data, the logs of losses has no meaning (in my opinion, the purpose of logging losses is to see if the model is converged or not, and to do that, it requires a validation data)

Note: I think setting it True means also that the blank label has to have the value 0 so should guarantee this I guess to avoid a bug?

=> I've tested the mask option and no mask option in Embedding layer. This is the results using recognize():

WER (%) CER (%)
_Nomasked 39.8549805 21.5941677
Masked 35.3230591 18.7023373

=> It seems like the mask option brings better results. However, I used the model trained using mask option then disable the mask and load the weights of the trained model using mask option

=> The trained masked model learned to ignore the prepended blank index where the new "no mask" model reusing that weights does not ignore.

=> Therefore, I think if you use mask option, then the blank must always be 0. And if you want to use other blank index, don't use mask option and let the model learn the prepended blank.

I think you should check these:

  1. Make sure the dataset is correct, the audio files match the labels (dumb question but it happens sometimes :laughing: )
  2. Vocab_size must contains blank
  3. Embedding_size usually greater than vocab_size (i.e 256)
  4. Make sure you give correct number of characters, and the labels should be .lower()
  5. Make sure the preprocessing char-to-index and postprocessing index-to-char are the same

Your implementation still seems fine for me. And I can't find any idea how you got those results :disappointed:

The best solution I can give you is that I suggest you spend some time to check again the whole pipeline such as data, preprocessing audio, preprocessing text, models, recognize, postprocessing from index to characters.

stefan-falk commented 3 years ago

=> What i mean is I only plot the average loss of about 200 batches (before computing batch 201, the tf.keras.metrics.Mean will be resetted to 0) and the 30 value is not the "first" 200 batches but it's the loss value after like 15 epochs (the loss values depend on the data). In general, the last loss value must be like < 12 for rnnt_loss 😄 in my case.

Yes, I thought that you're probably plotting some average :) Thank you for those numbers - it gives me some orientation of where I should land 🤣

You should create "validation" dataset because it will be easier to see if the model is converged or not. When the model converges, the line of "validation loss" and "train loss" will meet each other and then after that the line of validation loss will be higher due to overfitting.

I have a validation dataset but I disabled it during development. I'll start another training and see what the validation loss looks like. Last time I checked it was "fine" in the sense that it converged like the training loss.

=> I've tested the mask option and no mask option in Embedding layer. This is the results using recognize():

Alright, mask_zero it is! It really seems to improve the model.

=> Therefore, I think if you use mask option, then the blank must always be 0

Agree. :)

I think you should check these: ..

I am afraid you are right. I'll have to go through all these things and check if they work. I am convinced that they should work but it's obviously not the case.

I just have one question regarding

Embedding_size is usually greater than vocab_size (i.e 256)

So .. the entire time I was (re)using a dataset which already has a vocabulary - that vocab size is around 5000 subwords. Do you think that this could be the problem? As far as I know it should work with a larger vocabulary (subwords instead of characters) as well but maybe that's the problem?

Your implementation still seems fine for me. And I can't find any idea how you got those results 😞

Thank you for taking a look! 👍

stefan-falk commented 3 years ago

Small update:

This is another training I startet over night. I have increased the model size which seems to be helpful:

image

The loss is not dropping as fast as yours but to me it looks like the model converges as it should.

However, the issue is still there when I evaluate the model in a separate script. So the issue might be in that script but that just does not make any sense because in that script I am just loading the data and send it to the model 🤷‍♂️ what am I doing wrong 🤣

nglehuy commented 3 years ago

@stefan-falk I used learning rate schedule, that's why the loss decreased so fast :laughing:

I think for now, the model converged so the issue lies in either the recognize() or the test data doesn't match between audio and labels. I'm pretty sure you do it right for the test data. Maybe you can reuse my new recognize() to see if it solves the problem. Because the recognize_beam() I implemented using for and dict like that still causes repeated characters.

stefan-falk commented 3 years ago

@usimarit I think I have a candidate for the issue.

So.. I've been re-using a dataset the whole time and with it comes some preprocessing logic that is responsible for computing the MFCC features from the audio. I ported that code from TF1 but I didn't think about one last step that's inside that routine: Apply convolutions to the input.

I think what happens here is that I am applying the preprocessing logic here during training but those convolutions I mentioned are not part of the model hence never get loaded and/or applied:

# apply_convolutions()
mel_fbanks.set_shape([None, None, num_mel_bins, num_channels])
mel_fbanks = tf.pad(mel_fbanks, [[0, 0], [0, 8], [0, 0], [0, 0]])

for _ in range(2):
    mel_fbanks = tf.compat.v1.layers.conv2d(mel_fbanks, 128, (3, 3), (2, 2), use_bias=False)
    mel_fbanks = layer_norm(mel_fbanks)
    mel_fbanks = tf.nn.relu(mel_fbanks)

mel_fbanks_shape = mel_fbanks.get_shape().as_list()

# Apply a convolution that will remove all frequencies and at the same time
# project the output into desired hidden_size
mel_fbanks = tf.pad(mel_fbanks, [[0, 0], [0, 2], [0, 0], [0, 0]])
mel_fbanks = tf.compat.v1.layers.conv2d(mel_fbanks, hidden_size, (3, mel_fbanks_shape[2]))

assert mel_fbanks.get_shape().as_list()[2] == 1
mel_fbanks = layer_norm(mel_fbanks)
mel_fbanks = tf.nn.relu(mel_fbanks)

So, I think those layers are just "not there" as I try to evaluate the model. 🤦

I'll try to move this part inside the model as an additional layer of the EncoderNetwork, retrain the whole thing and hope that this is indeed the issue.

stefan-falk commented 3 years ago

@usimarit As expected: This was indeed causing the issue ^^ the model finally works and produces something useful. Just wanted to let you know and say thanks again for your support! 😄

nglehuy commented 3 years ago

@stefan-falk No problem :laughing: I'm gonna close the issue here :+1: