Closed aiaimimi0920 closed 10 months ago
@bruvzg Do you have any idea how to resolve this to support Chinese?
utf8
is static function, so it should be:
Dictionary cur_transcribed_msg;
cur_transcribed_msg["is_partial"] = transcribed[i].is_partial;
- String cur_text;
- cur_transcribed_msg["text"] = cur_text.utf8(transcribed[i].text.c_str());
+ cur_transcribed_msg["text"] = String::utf8(transcribed[i].text.c_str());
ret.push_back(cur_transcribed_msg);
The error is most likely correct and your string is not UTF-8 (might be some unsupported variant like CESU-8). Can use provide a raw content of the source string that is causing this error?
Or it's something wrong with the previous msg.text
processing, it seems to be inserting stuff into a string, and probably not taking into account UTF-8 character sizes, so it might be adding stuff in the middle of sequence. Or it's some segments received from whisper_full_get_token_text
are incorrectly skipped and cutting part of the sequence.
So ideally, please provide (for the problematic text), otherwise it's hard to deduce what's wrong and at what step:
whisper_full_get_token_text
(all separate segments).transcribed[i].text
.Can I output the text you need through UtilityFunctions:: print? Or through other func?
UtilityFunctions::print("token", token);
UtilityFunctions::print("text", text);
UtilityFunctions::print("msg.text", msg.text);
UtilityFunctions::print("utf8-msg.text", String::utf8(msg.text.c_str()));
Can I output the text you need through UtilityFunctions:: print
No, it will try to print it as a string, and fail it there's something wrong, so you'll need to use custom print functon, something like:
void print_hex(const std::string &p_string) {
for (i = 0; i < p_string.size(); i++) {
if (i > 0) printf(" ");
printf("%02X", p_string[i]);
}
printf("\n");
}
ok, I will try to test it
Thank you very much for your help. Indeed, as you said, it was because I overlooked some of the text corresponding to tokens that caused it
if (token.p > 0.6 && token.plog < -0.5) {
WARN_PRINT("Skipping token " + String::num(token.p) + " " + String::num(token.plog) + " " + text);
continue;
}
if (token.plog < -1.0) {
WARN_PRINT("Skipping token low plog " + String::num(token.p) + " " + String::num(token.plog) + " " + text);
continue;
}
It may be related to this issue, as each token does not necessarily correspond to a complete utf8 byte https://github.com/ggerganov/whisper.cpp/pull/1313#issuecomment-1875832602
In summary, I removed the step of skipping text corresponding to certain probability tokens, and everything is now working properly.
Thank you again for your help @bruvzg
Because I wanted to pass the text to Godot and support Chinese, I used
cur_transcribed_msg["text"] = cur_text.utf8(transcribed[i].text.c_str());
but in some cases, it may trigger invalid encoding,
like this
Unicode parsing error, some characters were replaced with ? (U+FFFD): Invalid UTF-8 leading byte (98)
I'm not quite sure
Perhaps you can help solve this problem?