ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.81k stars 3.65k forks source link

Produces different output depending on whether -nt is enabled #2312

Open rexendevar opened 4 months ago

rexendevar commented 4 months ago

It seems to be producing different transcription outputs, depending on whether the no-transcriptions flag is enabled. You can see even on the JFK.wav file it's removing commas.

image

Janderio commented 1 month ago

Having the same problem here on CPU only and on CUDA build . The reason behind is a time missmatch/shift in token association. The following example produced by using -ojf states the word "aber" be at 00:02:51,940 while real position in audio file is at 00:02:34,200. A few words later it believes to be at time 00:03:00, skips the audio between real position and continues with a gap/missing text. This happening frequently until end of file.

Example with -nt

                {
                    "text": " aber",
                    "timestamps": {
                        "from": "00:02:51,940",
                        "to": "00:02:53,910"
                    },
                    "offsets": {
                        "from": 171940,
                        "to": 173910
                    },
                    "id": 4340,
                    "p": 0.998843,
                    "t_dtw": -1
                },
                {
                    "text": " nicht",
                    "timestamps": {
                        "from": "00:02:53,930",
                        "to": "00:02:56,430"
                    },
                    "offsets": {
                        "from": 173930,
                        "to": 176430
                    },
                    "id": 1979,
                    "p": 0.999786,
                    "t_dtw": -1
                },
                {
                    "text": " mehr",
                    "timestamps": {
                        "from": "00:02:56,440",
                        "to": "00:02:58,430"
                    },
                    "offsets": {
                        "from": 176440,
                        "to": 178430
                    },
                    "id": 5417,
                    "p": 0.999682,
                    "t_dtw": -1
                },
                {
                    "text": ".",
                    "timestamps": {
                        "from": "00:02:58,430",
                        "to": "00:02:59,980"
                    },
                    "offsets": {
                        "from": 178430,
                        "to": 179980
                    },
                    "id": 13,
                    "p": 0.999716,
                    "t_dtw": -1
                },
                {
                    "text": "[_EOT_]",
                    "timestamps": {
                        "from": "00:03:00,000",
                        "to": "00:03:00,000"
                    },
                    "offsets": {
                        "from": 180000,
                        "to": 180000
                    },
                    "id": 50257,
                    "p": 0.244065,
                    "t_dtw": -1
                }

While not using -nt produces correct timestamps:

{
                    "text": " aber",
                    "timestamps": {
                        "from": "00:02:34,200",
                        "to": "00:02:34,400"
                    },
                    "offsets": {
                        "from": 154200,
                        "to": 154400
                    },
                    "id": 4340,
                    "p": 0.999746,
                    "t_dtw": -1
                },
                {
                    "text": " nicht",
                    "timestamps": {
                        "from": "00:02:34,430",
                        "to": "00:02:34,710"
                    },
                    "offsets": {
                        "from": 154430,
                        "to": 154710
                    },
                    "id": 1979,
                    "p": 0.999973,
                    "t_dtw": -1
                },
                {
                    "text": " mehr",
                    "timestamps": {
                        "from": "00:02:34,710",
                        "to": "00:02:34,920"
                    },
                    "offsets": {
                        "from": 154710,
                        "to": 154920
                    },
                    "id": 5417,
                    "p": 0.999939,
                    "t_dtw": -1
                },
                {
                    "text": ".",
                    "timestamps": {
                        "from": "00:02:34,930",
                        "to": "00:02:35,140"
                    },
                    "offsets": {
                        "from": 154930,
                        "to": 155140
                    },
                    "id": 13,
                    "p": 0.999998,
                    "t_dtw": -1
                },
                {
                    "text": "[_TT_1127]",
                    "timestamps": {
                        "from": "00:02:35,140",
                        "to": "00:02:35,140"
                    },
                    "offsets": {
                        "from": 155140,
                        "to": 155140
                    },
                    "id": 51492,
                    "p": 0.711062,
                    "t_dtw": -1
                }

(both json are from same input file) - sorry about the german language content