alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.7k stars 1.08k forks source link

python: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte #1279

Open KJ7LNW opened 1 year ago

KJ7LNW commented 1 year ago

So far I've not been able to reproduce this problem, but while using nerd-dictation, we have hit a Vosk decoding issue that appears to be rooted in the Bosk Python API code. I am running Python version 3.6 on CentOS 7 (which gets updates form Red Hat until 2024) while using the vosk-model-en-us-0.42-gigaspeech model.

You can see the backtrace below. Notice that the last line triggers an error within the Vosk API at "vosk/init.py", line 194, in FinalResult

Traceback (most recent call last):
  File "./nerd-dictation", line 1962, in <module>
    main()
  File "./nerd-dictation", line 1958, in main
    args.func(args)
  File "./nerd-dictation", line 1845, in <lambda>
    vosk_grammar_file=args.vosk_grammar_file,
  File "./nerd-dictation", line 1440, in main_begin
    vosk_grammar_file=vosk_grammar_file,
  File "./nerd-dictation", line 1215, in text_from_vosk_pipe
    json_text = rec_handle_fn_wrapper_from_final_result()
  File "./nerd-dictation", line 1054, in rec_handle_fn_wrapper_from_final_result
    json_text = rec.FinalResult()
  File "/usr/src/nerd-dictation/lib64/python3.6/site-packages/vosk/__init__.py", line 194, in FinalResult
    return _ffi.string(_c.vosk_recognizer_final_result(self._handle)).decode("utf-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 0: invalid start byte

@ideasman42, the developer of nerd-dictation suggests that this could be fixed in Vosk by adding errors=ignore. For example:

>>> b'A\xaeB'.decode('utf-8', errors='ignore')
'AB'

There are 4 different locations where text is decoded to UTF-8, so perhaps they need fixed up as well:

  1. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L188
  2. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L191
  3. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L194
  4. https://github.com/alphacep/vosk-api/blob/master/python/vosk/__init__.py#L267
nshmyrev commented 1 year ago

Do you use original gigaspeech model or did you modify it? I can't see a way original model to return non-utf8 char.

KJ7LNW commented 1 year ago

Original, unmodified.

nshmyrev commented 1 year ago

We need to reproduce it somehow. The 0xa0 output is very strange to be honest, feels more like a memory corruption. How often do you see this issue?

KJ7LNW commented 1 year ago

I've only seen it once. If it happens again I'll let you know.

nshmyrev commented 1 year ago

Ok, lets keep it open, I'll think how to catch it better.

KJ7LNW commented 1 year ago

There is a possibility that this was triggered because the Vosk object was reset (rec.reset()) from a signal context while the API was executing. Nerd-dictation supports suspend through SIGTSTP/SIGSTOP, so when it gets a stop signal it issues a reset on the Vosk API object. If Vosk happened to be executing at that moment than it may create an inconsistency in the library. (Note that this is not multi-threading, just interruption from a signal.)

This is only speculation, but I wanted to point it out in case it's a problem being caused external to your API library.

In terms of troubleshooting, are there any 0xa0 characters in the text generated by the vosk-model-en-us-0.42-gigaspeech model, even if some of them are part of a Unicode sequence? If it is actually a character representation issue in the model and not an issue related to suspending the process and issuing a reset, then by finding all text examples that contain 0xa0, and we can try triggering it with those words.