V-Sekai / godot-whisper

An GDExtension addon for the Godot Engine that enables realtime audio transcription, supports OpenCL for most platforms, Metal for Apple devices, and runs on a separate thread.
MIT License
48 stars 5 forks source link

Invalid encoding #44

Closed aiaimimi0920 closed 5 months ago

aiaimimi0920 commented 5 months ago

Because I wanted to pass the text to Godot and support Chinese, I used

cur_transcribed_msg["text"] = cur_text.utf8(transcribed[i].text.c_str());

but in some cases, it may trigger invalid encoding,

like this Unicode parsing error, some characters were replaced with ? (U+FFFD): Invalid UTF-8 leading byte (98)

I'm not quite sure

  1. Is it related to the Windows I am using : https://github.com/ggerganov/whisper.cpp/pull/1313
  2. Is it related to the utf8 method?

Perhaps you can help solve this problem?

fire commented 5 months ago

@bruvzg Do you have any idea how to resolve this to support Chinese?

bruvzg commented 5 months ago

utf8 is static function, so it should be:

  Dictionary cur_transcribed_msg;
  cur_transcribed_msg["is_partial"] = transcribed[i].is_partial;
- String cur_text;
- cur_transcribed_msg["text"] = cur_text.utf8(transcribed[i].text.c_str());
+ cur_transcribed_msg["text"] = String::utf8(transcribed[i].text.c_str());
ret.push_back(cur_transcribed_msg);

The error is most likely correct and your string is not UTF-8 (might be some unsupported variant like CESU-8). Can use provide a raw content of the source string that is causing this error?

bruvzg commented 5 months ago

Or it's something wrong with the previous msg.text processing, it seems to be inserting stuff into a string, and probably not taking into account UTF-8 character sizes, so it might be adding stuff in the middle of sequence. Or it's some segments received from whisper_full_get_token_text are incorrectly skipped and cutting part of the sequence.

So ideally, please provide (for the problematic text), otherwise it's hard to deduce what's wrong and at what step:

aiaimimi0920 commented 5 months ago

Can I output the text you need through UtilityFunctions:: print? Or through other func?

UtilityFunctions::print("token", token);
UtilityFunctions::print("text", text);
UtilityFunctions::print("msg.text", msg.text);
UtilityFunctions::print("utf8-msg.text", String::utf8(msg.text.c_str()));
bruvzg commented 5 months ago

Can I output the text you need through UtilityFunctions:: print

No, it will try to print it as a string, and fail it there's something wrong, so you'll need to use custom print functon, something like:


void print_hex(const std::string &p_string) {
    for (i = 0; i < p_string.size(); i++) {
        if (i > 0) printf(" ");
        printf("%02X", p_string[i]);
    }
    printf("\n");
}
aiaimimi0920 commented 5 months ago

ok, I will try to test it

aiaimimi0920 commented 5 months ago

Thank you very much for your help. Indeed, as you said, it was because I overlooked some of the text corresponding to tokens that caused it

if (token.p > 0.6 && token.plog < -0.5) {
  WARN_PRINT("Skipping token " + String::num(token.p) + " " + String::num(token.plog) + " " + text);
  continue;
}
if (token.plog < -1.0) {
  WARN_PRINT("Skipping token low plog " + String::num(token.p) + " " + String::num(token.plog) + " " + text);
  continue;
}

It may be related to this issue, as each token does not necessarily correspond to a complete utf8 byte https://github.com/ggerganov/whisper.cpp/pull/1313#issuecomment-1875832602

In summary, I removed the step of skipping text corresponding to certain probability tokens, and everything is now working properly.

Thank you again for your help @bruvzg