Passing cyrillic words as a grammar does not match vocabulary

IgorFedchenko commented 1 year ago

Hello there,

I am using C# with Vosk, and trying to set specific grammar in cyrillic characters.

First of all, this will not work:

var recognizer = new VoskRecognizer(model, SampleRate, "[\"привет\"]");

Reason is that string does not play well with const char* for non-ascii symbols I believe - in Vosk logs you will get:

LOG (VoskAPI:Recognizer():recognizer.cc:63) ["яЁштхЄ"]
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: 'яЁштхЄ'

Now, let's try to pass same text as unicode escaped:

var grammar = JsonSerializer.Serialize(new[] { "привет" }); // grammar is ["\u043F\u0440\u0438\u0432\u0435\u0442"]
var recognizer = new VoskRecognizer(model, SampleRate, grammar);

Now encoded chars are passed properly, but text is not decoded and not found in text model I believe. Logs:

LOG (VoskAPI:Recognizer():recognizer.cc:63) ["\\u043F\\u0440\\u0438\\u0432\\u0435\\u0442"]
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: '\\u043F\\u0440\\u0438\\u0432\\u0435\\u0442'

Although obvoiusly "привет" is there (in vocabulary).

Not sure if this is a bug/missing feature, or I am just missing something - could you give some thoughts @nshmyrev ?

P.S. I tried to make calls to native API directly with PInvoke, marshalling string in different ways or even passing bytes directly in different encodings, but could not find working way to go. But there should be some, because there are models with non-ascii chars in language.

Thanks in advance!

nshmyrev commented 1 year ago

Something like

new VoskRecognizer(model, SampleRate, Encoding.UTF8.GetString(Encoding.Default.GetBytes("привет")));

should help you to convert cp1251 string to utf-8.

IgorFedchenko commented 1 year ago

@nshmyrev First of all - thanks for quick reply!

Playing with `string` encoding on C# side (does not work)

Unfortunetely, that does not work this way. AFAIK in C++ string stores whatever bytes you passed there (i.e. preserves encoding) - but in C# you can not have "UTF8 string" or "Windows 1251 string" - strings are always stored in UTF-16, while you are working with string class (doc - .NET uses UTF-16 encoding (represented by the [UnicodeEncoding](https://learn.microsoft.com/en-us/dotnet/api/system.text.unicodeencoding) class) for string instances.) . You can get byte[] for given string using Encoding class for encoding you need, but putting bytes back to string stores text in UTF-16 again.

So when you are passing string to C++ wrapper, it is passed as UTF-16, Probably it could be solved by using const w_char* instead (like proposed here), but my C++ knowledge is very basic, not sure if that is a simple replacement.

But just to cofirm, here is your example and it's output:

var g = Encoding.UTF8.GetString(Encoding.Default.GetBytes("[\"привет\"]"));
var rec = new VoskRecognizer(model, SampleRate, g);

LOG (VoskAPI:Recognizer():recognizer.cc:63) ["яЁштхЄ"]
WARNING (VoskAPI:Recognizer():recognizer.cc:84) Ignoring word missing in vocabulary: 'яЁштхЄ'

This is expected, because Encoding.UTF8.GetString(Encoding.Default.GetBytes("[\"привет\"]")) returns string which stores UTF-16.

Passing encoded bytes directly (works, but something wrong on c++ side)

To pass string in custom encoding to C++, you can pass it as a pointer to by array. This is not supported by C# wrapper at the moment, so had to apply some reflection:

// Define PInvoke to support passing char* as raw pointer
[DllImport("libvosk", EntryPoint = "vosk_recognizer_new_grm")]
public static extern IntPtr new_VoskRecognizerGrm(HandleRef jarg1, float jarg2, IntPtr grammar);

var model = new Model(ModelPath);

// get unmamaged pointer to utf-8 encoded string as byte array
var grammar = "[\"привет\"]";
var bytes = Encoding.UTF8.GetBytes(grammar);
GCHandle pinnedArray = GCHandle.Alloc(bytes, GCHandleType.Pinned);
IntPtr pointer = pinnedArray.AddrOfPinnedObject();

// get Model's unmanaged handle, and pass it with uft-8 bytes to C++ code
var handle = (HandleRef)model.GetType().GetField("handle", BindingFlags.Instance | BindingFlags.NonPublic).GetValue(model);
var recPointer = new_VoskRecognizerGrm(handle, SampleRate, pointer);

Now, using console that is properly displaying UTF-8 chars, I am getting this:

So I can see that C++ does correctly load my grammar, but still can not match it with vocabulary. Probably this is related to how strings are represented in the model, or something like that.

@nshmyrev what do you think? Based on the screenshot above, it feels like something should be tuned on C++ side.

P.S. Let me know if creating sample reproduction project on GitHub can help you to address this, or if there is a work-around.

nshmyrev commented 1 year ago

Ok, we can change wrapper code to marshal to utf-8 in the constructor. I'll implement a bit later.

IgorFedchenko commented 1 year ago

OK. Although in the second part of my message I think text is passed as utf8, but still not found in vocabulary... Anyway, if you will confirm that Cyrillic chars work fine in grammar after fix, let me know.

Thanks for quick response again! Not something always happening in open source :)

alphacep / vosk-api