ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
33.21k stars 3.34k forks source link

JSON Output Contains Garbled Characters for Chinese Audio Transcription #2180

Open tylike opened 1 month ago

tylike commented 1 month ago

Environment:

Command Used:

main.exe -m ggml-large-v3.bin -of d:\VideoInfo\80\subtitle -d 60000 -osrt -ojf -otxt d:\VideoInfo\80\80.wav -l auto --prompt "这是**简体中文**内容,每一段落尽量长,使用标点符号逗号、句号、感叹号、问号、双引号等。**不要乱码**"

Issue:

Additional Details:

Steps to Reproduce:

  1. Run the command provided with a Chinese audio file.
  2. Check for garbled characters in the .json file.

en_subtitle.json en_subtitle.srt.txt 80.zip

tamo commented 1 month ago

~Apparently escape_double_quotes_and_backslashes is not valid for mutibyte strings.~ ~Maybe we should use replace or replace_all function for escaping.~ ~Also there may be other problems too.~

EDIT: It seems to be a tokenizer problem. Two characters (瑞典) became four tokens:

    "transcription": [
        {
...
            "text": "945年6月,瑞典著名犹太人建筑师马克思·甘佩尔接到了一份邀请函。",
            "tokens": [
...
                {
                    "text": "9",
...
                },
                {
                    "text": "45",
...
                },
                {
                    "text": "年",
...
                },
                {
                    "text": "6",
...
                },
                {
                    "text": "月",
...
                },
                {
                    "text": ",",
...
                },
                {
                    "text": "�",
...
                },
                {
                    "text": "�",
...
                },
                {
                    "text": "�",
...
                },
                {
                    "text": "�",
...
                },
                {
                    "text": "著",

Already reported here