ggerganov / whisper.cpp

Port of OpenAI's Whisper model in C/C++
MIT License
35.8k stars 3.65k forks source link

Either -dtw doesn't work as intended or I'm missing something #2148

Open magnacartatron opened 6 months ago

magnacartatron commented 6 months ago

I'm testing on large.v2 Here's the command ./main -m models/ggml-large-v2.bin -f samples/jfk.wav -dtw large.v2 -ojf -pp -ls

Here's the JSON output, I've removed timestamps for clarity as they match offsets.

{
    "text": " And",
    "offsets": {
        "from": 320,
        "to": 370
    },
    "id": 400,
    "p": 0.644984,
    "t_dtw": 56
},
{
    "text": " so",
    "offsets": {
        "from": 370,
        "to": 530
    },
    "id": 370,
    "p": 0.904659,
    "t_dtw": 90
},
{
    "text": ",",
    "offsets": {
        "from": 690,
        "to": 860
    },
    "id": 11,
    "p": 0.370488,
    "t_dtw": 108
},
{
    "text": " my",
    "offsets": {
        "from": 860,
        "to": 1110
    },
    "id": 452,
    "p": 0.900208,
    "t_dtw": 124
},
{
    "text": " fellow",
    "offsets": {
        "from": 1110,
        "to": 1850
    },
    "id": 7177,
    "p": 0.814694,
    "t_dtw": 158
},

How is one meant to interpret the t_dtw field. If I don't run it with the -dtw option then it's -1 If I do then I'm seeing these numbers. I've tried every possible combination to figure out how the t_dtw can be used but there's no pattern. Am I missing something here. Even if it's 100ths of a second I'm looking at it still doesn't match up with audio and offsets are more correct.

jason-ni commented 2 weeks ago

Have you look at the source code? image

magnacartatron commented 2 weeks ago

Have you look at the source code? image

I'm not sure how to interpret this though.