New force-alignment API and two-pass alignment to get phone/state durations

cmusphinx / pocketsphinx

A small speech recognizer

Other

3.9k stars 714 forks source link

New force-alignment API and two-pass alignment to get phone/state durations #300

Closed dhdaines closed 1 year ago

dhdaines commented 1 year ago

Now you can (relatively) easily do a second pass of alignment to get phone durations after decoding or word alignment.

Also, word alignment now uses FSG search, like SoundSwallower, so it's really fast and also handles silence and alternate pronunciations for you.

lenzo-ka commented 1 year ago

Excited to check this out! I'm at Interspeech and out of phase by half day and all, but I'll get a look shortly

dhdaines commented 1 year ago

No problem! The CLI for state alignment isn't quite there yet, but coming soon (tonight, I hope).

jsalsman commented 1 year ago

Fantastic! I also hope to try this out ASAP. I wonder whether constraining to the first pass's word boundaries will help. It seems like it can't hurt, but it would be interesting to measure how much.

On Wed, Sep 21, 2022 at 3:42 PM David Huggins-Daines < @.***> wrote:

No problem! The CLI for state alignment isn't quite there yet, but coming soon (tonight, I hope).

— Reply to this email directly, view it on GitHub https://github.com/cmusphinx/pocketsphinx/pull/300#issuecomment-1254308132, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ4RVFMZXPP37UTRA5BSBTV7OFOXANCNFSM6AAAAAAQSKE6YM . You are receiving this because you are subscribed to this thread.Message ID: @.***>

dhdaines commented 1 year ago

Fantastic! I also hope to try this out ASAP. I wonder whether constraining to the first pass's word boundaries will help. It seems like it can't hurt, but it would be interesting to measure how much.

It will definitely make the alignment faster. It may make it more accurate though I am not certain of this - I have to look at how I implemented this back in 2006: https://www.cs.cmu.edu/~dhuggins/Publications/phlab.pdf

EDIT: that paper was about forward-backward and not alignment, so not the same thing at all - in that case I implemented something like semi-Viterbi training, setting "impossible" phone sequences to zero probability, which resulted in models that were better for alignment (but somewhat worse for recognition)

dhdaines commented 1 year ago

Hoping for state level alignments, and frame level scores also, but LGTM and WFM

State level alignments are already there in the Python API, look at cython/test/alignment_test.py for an example, but it is now easy to add them to the command-line front-end as well, so I'll do that (not on by default though)