JSON Output Contains Garbled Characters for Chinese Audio Transcription

ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++

MIT License

33.21k stars 3.34k forks source link

Environment:

OS: Windows 11 (cmd with UTF-8 output)
Whisper.cpp Version: 1.6.0
Model: V3 large

Command Used:

main.exe -m ggml-large-v3.bin -of d:\VideoInfo\80\subtitle -d 60000 -osrt -ojf -otxt d:\VideoInfo\80\80.wav -l auto --prompt "这是**简体中文**内容,每一段落尽量长,使用标点符号逗号、句号、感叹号、问号、双引号等。**不要乱码**"

Issue:

The .txt and .srt files are generated correctly.
The .json file contains garbled/incorrect characters.

Additional Details:

When using English audio, the .json, .srt, and .txt files are all generated correctly.

Steps to Reproduce:

Run the command provided with a Chinese audio file.
Check for garbled characters in the .json file.

en_subtitle.json en_subtitle.srt.txt 80.zip

"transcription": [ { ... "text": "945年6月,瑞典著名犹太人建筑师马克思·甘佩尔接到了一份邀请函。", "tokens": [ ... { "text": "9", ... }, { "text": "45", ... }, { "text": "年", ... }, { "text": "6", ... }, { "text": "月", ... }, { "text": ",", ... }, { "text": "�", ... }, { "text": "�", ... }, { "text": "�", ... }, { "text": "�", ... }, { "text": "著",

ggerganov / whisper.cpp

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180