amaas / stanford-ctc

Neural net code for lexicon-free speech recognition with connectionist temporal classification
Apache License 2.0
248 stars 93 forks source link

what's the logic of performing normalization each time step in ctc? #8

Open ghost opened 8 years ago

ghost commented 8 years ago

Hi, I noticed in ctc.py, to compute the forward/backward variable, there is normalization performed at each time step (e.g., Line32, Line54). I can't figure out the logic behind, can you guys explain what the forward variable 'alphas[s,t]' does mean after the normalization? And if I want to compute the conditional probability p(seq|params), what is the equivalent? I believe this probability can not be calculated by alphas[L-1, T-1] + alphas[L-2, T-1] after the normalization.

zxie commented 8 years ago

Sorry for delay, replying in case.

As you may have already seen, the normalization is just to prevent underflow as described right before Section 4.2 in http://www.cs.toronto.edu/~graves/icml_2006.pdf

The log conditional probability can still be computed as described in that section (lines 55 and 83 in code).

ghost commented 8 years ago

@zxie :+1: Thank you. According to your mentioned paper I found the deduction in "A tutorial on hidden Markov models and selected applications in speech recognition" by Rabiner, 1989. The reason I asked the question is because in Alex Gray's Ph.D. dissertation, he mentioned in Section7.3.1 "A good way to avoid this is to work in the log scale...Note that rescaling the variables at every timestep (Rabiner,1989) is less robust, and can fail for very long sequences." I've tested various CTC implementations by @skaae (https://github.com/skaae/Lasagne-CTC), @mohammadpz (https://github.com/mohammadpz/CTC-Connectionist-Temporal-Classification), myself and you. It turns out your implementation produces the best result in estimation of probability p(l|x). The former three implementations are all in log scale. Do you have any comment on this?