Const-me / Whisper

High-performance GPGPU inference of OpenAI's Whisper automatic speech recognition (ASR) model
Mozilla Public License 2.0
8.2k stars 702 forks source link

buffer added to avoid splitted chatacter #122

Open Ovler-Young opened 1 year ago

Ovler-Young commented 1 year ago

inspired by https://github.com/ggerganov/whisper.cpp/issues/399#issuecomment-1508222875

The issue stems from the possibility that the token text may not adhere to the valid utf-8 string format. When using OpenAI's tiktoken tokenizer, a Chinese character in utf-8 encoding could be split into multiple tokens, which leading to the problem. In such a scenario printf("%s", text) outputs a scrambled or unintelligible string. To resolve the issue I use icu library to check whether the token text is a valid utf-8 string or not. If yes, print out as usual; if not, the token text is pushed back to a temporary char buffer instead. This char buffer will not be printed out until bytes in the buffer form a valid utf-8 string.

In this repo, the problem is very similar. Instead of use the icu library which might only on linux, I found a way to check it by pure c++, so no need to modify the makefile.

Some related issues: #109 #37