cmusphinx / pocketsphinx

A small speech recognizer
Other
3.87k stars 713 forks source link

Failed to reach final state in alignment #339

Closed mharvilla closed 1 year ago

mharvilla commented 1 year ago

I am using Python PocketSphinx 5.0.0 for US English forced alignment. Unfortunately, I am frequently encountering the following error when doing alignments, despite the transcript being correct and the recording being free of noise.

ERROR: "state_align_search.c", line 228: Failed to reach final state in alignment
Failed on second-pass alignment: Failed to stop utterance processing

The following is a code snippet from my project which does the actual alignment using PocketSphinx.

if self._transcript:
    try:
        self.decoder.start_utt()
        self.decoder.process_raw(raw_audio, full_utt=True)
        self.decoder.end_utt()
    except RuntimeError as e:
        print(f'Failed on first-pass alignment: {e}')

    try:
        self.decoder.set_alignment()
    except RuntimeError as e:
        print(f'Failed to set alignment: {e}')

    try:
        self.decoder.start_utt()
        self.decoder.process_raw(raw_audio, full_utt=True)
        self.decoder.end_utt()
    except RuntimeError as e:
        print(f'Failed on second-pass alignment: {e}')

    self.alignment = self.parse_alignment(self.decoder.get_alignment())

As you can see, the final try-except block is being tripped.

Is there a way to loosen some parameters in PocketSphinx so that it can at least produce an alignment (even if suboptimal) without throwing an exception? I cannot find anything in the docs.

P.S. I am aware that there are better tools out there for FA, but this lightweight library is expedient to work with at the moment.

Thanks in advance for any advice.

dhdaines commented 1 year ago

Hi - it's actually a bit strange that the first-pass alignment would succeed while the second-pass one would fail... You can try setting all the beams very wide, like we do in ReadAlongs: https://github.com/ReadAlongs/Studio/blob/main/readalongs/align.py#L243

(we're actually using https://github.com/ReadAlongs/SoundSwallower for force-alignment, which is basically PocketSphinx but without various extra stuff...)

mharvilla commented 1 year ago

Thank you @dhdaines for your speedy response. I will give this a try.

mharvilla commented 1 year ago

Same issue. I'm noticing that it seems to happen more often for audio samples that are spoken slowly, and/or when the syllables of certain words are stretched out. Such samples are purposeful for the research that I'm doing. I've attached two examples that fail, for the transcript: "feels like these days go on forever"

>>> decoder = Decoder(beam=0, wbeam=0, pbeam=0)
>>> decoder.set_align_text('feels like these days go on forever')
>>> decoder.start_utt()
>>> decoder.process_raw(raw_audio, full_utt=True)
>>> decoder.end_utt()
>>> decoder.set_alignment()
>>> decoder.start_utt()
>>> decoder.process_raw(raw_audio, full_utt=True)
>>> decoder.end_utt()
ERROR: "state_align_search.c", line 228: Failed to reach final state in alignment
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_pocketsphinx.pyx", line 993, in _pocketsphinx.Decoder.end_utt
RuntimeError: Failed to stop utterance processing

Last "ER" is sustained on purpose: https://drive.google.com/file/d/1T20QiXTlXLbI58qU7eDp_BQ3MZ6EJBo5/view?usp=share_link Overall just spoken very slowly: https://drive.google.com/file/d/1eA5-FcXJd51HNaJiU9VpwZEvqZskaKsT/view?usp=share_link

dhdaines commented 1 year ago

Ah, thanks for the examples! On the face of it it doesn't seem like these should be failing. I'll take a bit of time on Friday to see if there is a bug here.

Would it be okay to add them to the test suite?

mharvilla commented 1 year ago

Sure, feel free to add them to the tests. Thanks!

dhdaines commented 1 year ago

Hi, sorry I didn't manage to get to this yet! I am doing a bug-fix release which, unfortunately, won't fix this, as it may be a bit too involved to debug. One thing you may try to work around this is to not call end_utt before getting the alignment, because often it's the case that getting a "partial" result of force-alignment will succeed, but the "final" result will fail.

mharvilla commented 1 year ago

So this means that it won't be fixed...? :( I'll try the workaround.

dhdaines commented 1 year ago

I'll hopefully fix it! Just not this week...

On Fri, May 19, 2023, at 17:26, mharvilla wrote:

So this means that it won't be fixed...? :( I'll try the workaround.

— Reply to this email directly, view it on GitHub https://github.com/cmusphinx/pocketsphinx/issues/339#issuecomment-1555270623, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZLYUGRWDU3U6AEOOOBSFTXG7QPJANCNFSM6AAAAAAXDLAOUY. You are receiving this because you were mentioned.Message ID: @.***>

-- David Huggins-Daines · @.*** · +1 514 549 3105 · https://ecolingui.ca/ consultation, rédaction, recherche et développement · technologies sobres · intelligence naturelle consulting, writing, research, and development · low-tech solutions · human intelligence

dhdaines commented 1 year ago

OK! So, um, this may be specific to the Python API, or to the way you are using it, because from the command-line it works just fine for me:

$ ./build/pocketsphinx -hmm model/en-us/en-us -dict model/en-us/cmudict-en-us.dict align test/data/forever/input_2_16k.wav feels like these days go on forever
{"b":0.000,"d":3.350,"p":1.000,"t":"feels like these days go on forever","w":[{"b":0.000,"d":0.440,"p":0.957,"t":"feels"},{"b":0.440,"d":0.300,"p":0.973,"t":"like"},{"b":0.740,"d":0.280,"p":0.975,"t":"these"},{"b":1.020,"d":0.400,"p":0.976,"t":"days"},{"b":1.420,"d":0.180,"p":0.981,"t":"go"},{"b":1.600,"d":0.370,"p":0.981,"t":"on(2)"},{"b":1.970,"d":1.230,"p":0.948,"t":"forever"},{"b":3.200,"d":0.140,"p":0.948,"t":"<sil>"}]}
$ ./build/pocketsphinx -hmm model/en-us/en-us -dict model/en-us/cmudict-en-us.dict align test/data/forever/input_4_16k.wav feels like these days go on forever
{"b":0.000,"d":5.760,"p":1.000,"t":"feels like these days go on forever","w":[{"b":0.000,"d":0.130,"p":0.943,"t":"<sil>"},{"b":0.130,"d":0.550,"p":0.938,"t":"<sil>"},{"b":0.680,"d":0.670,"p":0.898,"t":"feels"},{"b":1.350,"d":0.590,"p":0.912,"t":"like"},{"b":1.940,"d":0.550,"p":0.918,"t":"these"},{"b":2.490,"d":0.550,"p":0.966,"t":"days"},{"b":3.040,"d":0.510,"p":0.958,"t":"go"},{"b":3.550,"d":0.600,"p":0.963,"t":"on(2)"},{"b":4.150,"d":0.190,"p":0.947,"t":"<sil>"},{"b":4.340,"d":0.560,"p":0.937,"t":"forever"},{"b":4.900,"d":0.850,"p":0.904,"t":"<sil>"}]}

I'll poke around a bit with the code you provided above.

mharvilla commented 1 year ago

I see. Maybe you can replicate the error using the following sequence of calls, which is how I encounter it (on the last end_utt call).

Decoder.start_utt()
Decoder.process_raw(...)
Decoder.end_utt()
Decoder.set_alignment()
Decoder.start_utt()
Decoder.process_raw(...)
Decoder.end_utt()

I can share my actual code if you want.

Thank you for the update.

dhdaines commented 1 year ago

Hm - the call to set_alignment() there seems suspicious to me. This reminds me that in my CLI example above I wasn't doing phone-level alignment. But yes, this works too:

$ ./build/pocketsphinx -phone_align yes -hmm model/en-us/en-us -dict model/en-us/cmudict-en-us.dict align test/data/forever/input_2_16k.wav feels like these days go on forever
{"b":0.000,"d":3.350,"p":1.000,"t":"feels like these days go on forever","w":[{"b":0.000,"d":0.440,"p":0.973,"t":"feels","w":[{"b":0.000,"d":0.170,"p":0.992,"t":"F"},{"b":0.170,"d":0.090,"p":0.995,"t":"IY"},{"b":0.260,"d":0.090,"p":0.995,"t":"L"},{"b":0.350,"d":0.090,"p":0.992,"t":"Z"}]},{"b":0.440,"d":0.300,"p":0.972,"t":"like","w":[{"b":0.440,"d":0.110,"p":0.982,"t":"L"},{"b":0.550,"d":0.080,"p":0.995,"t":"AY"},{"b":0.630,"d":0.110,"p":0.994,"t":"K"}]},{"b":0.740,"d":0.280,"p":0.982,"t":"these","w":[{"b":0.740,"d":0.040,"p":0.995,"t":"DH"},{"b":0.780,"d":0.120,"p":0.994,"t":"IY"},{"b":0.900,"d":0.120,"p":0.993,"t":"Z"}]},{"b":1.020,"d":0.400,"p":0.954,"t":"days","w":[{"b":1.020,"d":0.050,"p":0.980,"t":"D"},{"b":1.070,"d":0.240,"p":0.985,"t":"EY"},{"b":1.310,"d":0.110,"p":0.988,"t":"Z"}]},{"b":1.420,"d":0.180,"p":0.978,"t":"go","w":[{"b":1.420,"d":0.090,"p":0.984,"t":"G"},{"b":1.510,"d":0.090,"p":0.995,"t":"OW"}]},{"b":1.600,"d":0.370,"p":0.985,"t":"on(2)","w":[{"b":1.600,"d":0.280,"p":0.991,"t":"AO"},{"b":1.880,"d":0.090,"p":0.995,"t":"N"}]},{"b":1.970,"d":1.230,"p":0.938,"t":"forever","w":[{"b":1.970,"d":0.140,"p":0.988,"t":"F"},{"b":2.110,"d":0.130,"p":0.993,"t":"ER"},{"b":2.240,"d":0.100,"p":0.995,"t":"EH"},{"b":2.340,"d":0.100,"p":0.987,"t":"V"},{"b":2.440,"d":0.760,"p":0.974,"t":"ER"}]},{"b":3.200,"d":0.140,"p":0.978,"t":"<sil>","w":[{"b":3.200,"d":0.140,"p":0.978,"t":"SIL"}]}]}

I'm debugging the Python stuff now (there are some annoying build issues that I should really fix too...)

dhdaines commented 1 year ago

aah - of course, calling set_alignment is exactly what the documentation says to do! I'll let you know in a bit...

dhdaines commented 1 year ago

Okay, I have found the problem. The issue is the bestpath configuration option which should always be disabled when doing phone-level alignment. In the CLI this is the case, but not in the Python module, because ... well, because I forgot.

You can simply add bestpath=False to the arguments to the Decoder() constructor to work around this for now, but it will be fixed shortly.

The technical details: bestpath does lattice rescoring of the results of first-pass search. This is actually more or less useless for FSG search, except for producing posterior probabilities (which may or may not be meaningless). The problem is that to do this we need to convert the lattice to a DAG, which means that we need to decide which lattice node is the final one. If you add loglevel=INFO to the Decoder() constructor, you will see these somewhat suspicious messages:

INFO: fsg_search.c(1265): Start node candidate feels.0:32:49
INFO: fsg_search.c(1304): End node candidate <sil>.320:324:333 (-491)
INFO: fsg_search.c(1304): End node candidate forever.197:239:333 (-458)
INFO: fsg_search.c(1529): lattice start node feels.0 end node </s>.334

That last one! It's bad! Why? Because it means we didn't find a </s> in the lattice, so we create a fake one in order to be able to do (again, pretty much useless for FSG search) rescoring, and indeed if you add backtrace=True to the constructor you'll see it (look at the end of this list):

word                 start end   pprob ascr       lscr       lback
feels                0     43    1.000 -449536    0          1
like                 44    73    1.000 -280576    0          1
these                74    101   1.000 -253952    0          1
days                 102   141   1.000 -251904    0          1
go                   142   159   1.000 -204800    0          1
on(2)                160   196   1.000 -195584    0          1
forever              197   334   1.000 -468992    0          1
</s>                 334   334   1.000 -468992    0          1

It is not possible for the state align search to align a phone with only a single frame! So it will always fail. I will spend a bit of time to make the code more robust in general here.

dhdaines commented 1 year ago

I think the proper fix here was not to disable bestpath but for the state align search to simply ignore any illegal orders to align things that can't be aligned.