The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string.
To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.
In this repo, the problem is very similar. Instead of use the icu library which might only on linux, I found a way to check it by pure c++, so no need to modify the makefile.
inspired by https://github.com/ggerganov/whisper.cpp/issues/399#issuecomment-1508222875
In this repo, the problem is very similar. Instead of use the
icu
library which might only on linux, I found a way to check it by pure c++, so no need to modify the makefile.Some related issues: #109 #37