Invalid encoding - Githubissues

V-Sekai / godot-whisper

An GDExtension addon for the Godot Engine that enables realtime audio transcription, supports OpenCL for most platforms, Metal for Apple devices, and runs on a separate thread.

MIT License

69 stars 7 forks source link

Invalid encoding #44

Closed aiaimimi0920 closed 10 months ago

aiaimimi0920 commented 10 months ago

Because I wanted to pass the text to Godot and support Chinese, I used

cur_transcribed_msg["text"] = cur_text.utf8(transcribed[i].text.c_str());

but in some cases, it may trigger invalid encoding,

like this Unicode parsing error, some characters were replaced with ? (U+FFFD): Invalid UTF-8 leading byte (98)

I'm not quite sure

Is it related to the Windows I am using : https://github.com/ggerganov/whisper.cpp/pull/1313
Is it related to the utf8 method?

Perhaps you can help solve this problem?

fire commented 10 months ago

@bruvzg Do you have any idea how to resolve this to support Chinese?

bruvzg commented 10 months ago

utf8 is static function, so it should be:

  Dictionary cur_transcribed_msg;
  cur_transcribed_msg["is_partial"] = transcribed[i].is_partial;
- String cur_text;
- cur_transcribed_msg["text"] = cur_text.utf8(transcribed[i].text.c_str());
+ cur_transcribed_msg["text"] = String::utf8(transcribed[i].text.c_str());
ret.push_back(cur_transcribed_msg);

The error is most likely correct and your string is not UTF-8 (might be some unsupported variant like CESU-8). Can use provide a raw content of the source string that is causing this error?

bruvzg commented 10 months ago

Or it's something wrong with the previous msg.text processing, it seems to be inserting stuff into a string, and probably not taking into account UTF-8 character sizes, so it might be adding stuff in the middle of sequence. Or it's some segments received from whisper_full_get_token_text are incorrectly skipped and cutting part of the sequence.

So ideally, please provide (for the problematic text), otherwise it's hard to deduce what's wrong and at what step:

raw (hex encoded byte sequence as it is in memory) output of whisper_full_get_token_text (all separate segments).
raw content of transcribed[i].text.

aiaimimi0920 commented 10 months ago

Can I output the text you need through UtilityFunctions:: print? Or through other func?

UtilityFunctions::print("token", token);
UtilityFunctions::print("text", text);
UtilityFunctions::print("msg.text", msg.text);
UtilityFunctions::print("utf8-msg.text", String::utf8(msg.text.c_str()));

bruvzg commented 10 months ago

Can I output the text you need through UtilityFunctions:: print

No, it will try to print it as a string, and fail it there's something wrong, so you'll need to use custom print functon, something like:


void print_hex(const std::string &p_string) {
    for (i = 0; i < p_string.size(); i++) {
        if (i > 0) printf(" ");
        printf("%02X", p_string[i]);
    }
    printf("\n");
}

aiaimimi0920 commented 10 months ago

ok, I will try to test it

aiaimimi0920 commented 10 months ago

Thank you very much for your help. Indeed, as you said, it was because I overlooked some of the text corresponding to tokens that caused it

if (token.p > 0.6 && token.plog < -0.5) {
  WARN_PRINT("Skipping token " + String::num(token.p) + " " + String::num(token.plog) + " " + text);
  continue;
}
if (token.plog < -1.0) {
  WARN_PRINT("Skipping token low plog " + String::num(token.p) + " " + String::num(token.plog) + " " + text);
  continue;
}

It may be related to this issue, as each token does not necessarily correspond to a complete utf8 byte https://github.com/ggerganov/whisper.cpp/pull/1313#issuecomment-1875832602

In summary, I removed the step of skipping text corresponding to certain probability tokens, and everything is now working properly.

Thank you again for your help @bruvzg