NVIDIA / OpenSeq2Seq

Toolkit for efficient experimentation with Speech Recognition, Text2Speech and NLP
https://nvidia.github.io/OpenSeq2Seq
Apache License 2.0
1.54k stars 369 forks source link

Definition of greedy decoder #390

Closed inchpunch closed 5 years ago

inchpunch commented 5 years ago

I am wondering where can I find the definition of greedy decoder? I saw on

https://nvidia.github.io/OpenSeq2Seq/html/speech-recognition.html#speech-recognition

it says "WER is the word error rate obtained on a dev-clean subset of LibriSpeech using greedy decoder (decoder_params/use_language_model = False). "

For the reference WERs on that page, except not using LM, does greedy decoder use the beam size of 512 as specified in the configuration file given by the links?

Thanks a lot!

bill-kalog commented 5 years ago

Probably someone from the team maintaining openSeq2Seq would be more appropriate to answer because they are doing changes at the moment to accommodate for extracting timesteps from the decoding too. For example here https://github.com/NVIDIA/OpenSeq2Seq/blob/1a58b2a563ce46a3ed87075cfed6bd6e008e49c8/open_seq2seq/utils/ctc_decoder.py But as your question goes, if I am not wrong, you can see which decoders are used in this file https://github.com/NVIDIA/OpenSeq2Seq/blob/1a58b2a563ce46a3ed87075cfed6bd6e008e49c8/open_seq2seq/decoders/fc_decoders.py#L247 So here I am pointing you to where the greedy decoder is and a few lines up you can see the language model decoder which calls the old lm_decoder mozilla deepspeech used to have in their repo. Here you can also see that the beam_width is only used when the call to the lm decoder is happening. If you want to do normal beam search (i.e not greedy with a tensorflow op) you can replace the tf.nn.ctc_greedy_decoder with tensorflows tf.nn.ctc_beam_search_decoder and give it your beam_width as an argument

inchpunch commented 5 years ago

Thanks for your information. I tried to do so, but WER gets much worse by using ctc_beam_search_decoder with beam width=1. The code that I modified in the fc_decoders.py is (starting line 240):

else:
  def decode_without_lm(logits, decoder_input, merge_repeated=True):
    if logits.dtype.base_dtype != tf.float32:
      logits = tf.cast(logits, tf.float32)
    # decoded, neg_sum_logits = tf.nn.ctc_greedy_decoder(
        # logits, decoder_input['encoder_output']['src_length'],
        # merge_repeated,
    # )
    decoded, neg_sum_logits = tf.nn.ctc_beam_search_decoder(
        logits, decoder_input['encoder_output']['src_length'],
        self.params['beam_width'], 1, merge_repeated,
    )
    return decoded

and in configuration file, in base_params I set

"decoder_params": {

    # params for decoding the sequence with language model
    "beam_width": 1,

Is there anything else that I missed?

bill-kalog commented 5 years ago

can you try with merge_repeated=False instead. merge_repeated seems to have a slightly different functionality between the two functions

inchpunch commented 5 years ago

Yes it works now. Thanks a lot.