Implement word-level timestamps approach proposed by OpenAI

ggerganov commented 1 year ago

See notebook, section "Word-level timestamps using attention weights":

https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb

djthorpe commented 1 year ago

This came up in my "explore" feed as a way to implement accurate word-level timestamps: https://github.com/m-bain/whisperX

kamranjon commented 1 year ago

There is a functioning implementation of the attention weights approach here: https://github.com/linto-ai/whisper-timestamped which might be a useful reference for implementing in whisper.cpp eventually.

HaujetZhao commented 1 year ago

The whisper python module itself provided a time-stamp output option, which could be a reference, and I tested it, the command is:

python -m whisper --model tiny --language en --word_timestamps True --output_dir "test_out" "test.wav"

it generated 5 files in the test_out folder:

test.json
test.txt
test.srt
test.vtt
test.tsv

In the test.json file, the content is:

{
    "text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place. I'm going to be recognized, people are going to know why I am. Now I'm here, I'm on vacation, I'm with my family, I just want to have the money back. I just want to be a normal person, so... I'm going to go to the kitchen. I'm at this girl's girl, she seemed to hit me tight. The need to surface she was more about her perfect life. It's not the best thing, she drives the main thing. And when I'm dreaming her to scream and daddy make it. She's a gold digger lover, she got it from her mom. She's never stepfather body, all they should want. She's a gold digger lover, she's a gold digger lover. If you recognize me now, don't you? I'm the only one. So, real life, nobody really knows who the heck I am. So, I have a plan, but I gotta make myself known. I gotta do this somehow, I gotta get my name out there. She's a gold digger lover, she's a gold digger lover. She's a gold digger lover, she's a gold digger lover. Have you heard of her? You never heard of her? Oh, it's great. She won't do anything like gold in time and fame. But pop up on a mark, the birds will fade away with a ramycine. Last blow, sing, last tip of two, she could last. By the way, do you know what's going to be? Can I get caught by a goodness? I don't know. She's a gold digger lover, she's been on the cover, she's a brand. No, pop up on a mark, you just look her up, she's great. She's a gold digger lover, she's a gold digger lover, she's a gold digger lover. Thank you. Thank you. Thank you. I thought it was a party party party, can you? So, okay, New York City, you may not know me yet. And all, as I've learned, you may not have heard of Lindsey Sterling, he hit my violinist before. What? Think you're looking up for me. Think you're on the bright side. Hello, how you doing? Yes. Okay. So, subscribe to my YouTube channel. Stop it this drought. I'm just gonna be fine. I got some great stuff coming through away. Do. Ace that. More come. Yeah. Let me surely sign me out.",
    "segments": [
        {
            "id": 0,
            "seek": 0,
            "start": 3.6,
            "end": 10.6,
            "text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place.",
            "tokens": [
                50364, 407, 11, 510, 311, 257, 869, 2307, 295, 1873, 3609, 11, 293, 286, 5334, 586, 11, 516, 484, 1908, 307, 257, 955, 11, 5856, 1081, 13, 50914
            ],
            "temperature": 0.0,
            "avg_logprob": -0.5041744733097577,
            "compression_ratio": 1.5665024630541873,
            "no_speech_prob": 0.08891408145427704,
            "words": [
                {"word": " So,","start": 3.6,"end": 3.96,"probability": 0.5301069021224976},
                {"word": " here's","start": 3.42,"end": 4.32,"probability": 0.6140210628509521},
                {"word": " a","start": 4.32,"end": 4.42,"probability": 0.1545887440443039},
                {"word": " great","start": 4.42,"end": 4.7,"probability": 0.6114427447319031},
                {"word": " city","start": 4.7,"end": 5.08,"probability": 0.9124268293380737},
                {"word": " of","start": 5.08,"end": 5.36,"probability": 0.9507943987846375},
                {"word": " New","start": 5.36,"end": 5.44,"probability": 0.9982349872589111},
                {"word": " York,","start": 5.44,"end": 6.18,"probability": 0.9951660633087158},
                {"word": " and","start": 6.44,"end": 6.56,"probability": 0.9580233097076416},
                {"word": " I","start": 6.56,"end": 6.66,"probability": 0.5875958204269409},
                {"word": " realized","start": 6.66,"end": 7.02,"probability": 0.5471060872077942},
                {"word": " now,","start": 7.02,"end": 7.86,"probability": 0.6020179390907288},
                {"word": " going","start": 8.04,"end": 8.12,"probability": 0.7494494318962097},
                {"word": " out","start": 8.12,"end": 8.38,"probability": 0.9883183240890503},
                {"word": " public","start": 8.38,"end": 8.72,"probability": 0.6699197888374329},
                {"word": " is","start": 8.72,"end": 8.98,"probability": 0.3241350054740906},
                {"word": " a","start": 8.98,"end": 9.14,"probability": 0.7641012072563171},
                {"word": " big,","start": 9.14,"end": 9.5,"probability": 0.4375719726085663},
                {"word": " busy","start": 9.5,"end": 9.94,"probability": 0.6939781308174133},
                {"word": " place.","start": 9.94,"end": 10.6,"probability": 0.8924348950386047}
            ]
        },
        {
            "id": 1,
            "seek": 0,
            "start": 11.7,
            "end": 15.16,
            "text": " I'm going to be recognized, people are going to know why I am.",
            "tokens": [
                50914, 286, 478, 516, 281, 312, 9823, 11, 561, 366, 516, 281, 458, 983, 286, 669, 13, 51114
            ],
            "temperature": 0.0,
            "avg_logprob": -0.5041744733097577,
            "compression_ratio": 1.5665024630541873,
            "no_speech_prob": 0.08891408145427704,
            "words": [
                {"word": " I'm","start": 11.7,"end": 11.8,"probability": 0.980172872543335},
                {"word": " going","start": 11.8,"end": 11.94,"probability": 0.32428041100502014},
                {"word": " to","start": 11.94,"end": 12.04,"probability": 0.9828474521636963},
                {"word": " be","start": 12.04,"end": 12.16,"probability": 0.9843984842300415},
                {"word": " recognized,","start": 12.16,"end": 12.58,"probability": 0.3810001611709595},
                {"word": " people","start": 13.22,"end": 13.5,"probability": 0.9561352729797363},
                {"word": " are","start": 13.5,"end": 13.6,"probability": 0.9821558594703674},
                {"word": " going","start": 13.6,"end": 13.78,"probability": 0.7550729513168335},
                {"word": " to","start": 13.78,"end": 13.8,"probability": 0.9977655410766602},
                {"word": " know","start": 13.8,"end": 14.0,"probability": 0.9933110475540161},
                {"word": " why","start": 14.0,"end": 14.32,"probability": 0.7471684813499451},
                {"word": " I","start": 14.32,"end": 14.58,"probability": 0.31861186027526855},
                {"word": " am.","start": 14.58,"end": 15.16,"probability": 0.9440820217132568}
            ]
        }
    ],
    "language": "en"
}

From a practicle view, the json word-timestamp file is quite useful.

bmurray commented 1 year ago

The method used to get per-word timestamps is pretty bad. The python version is substantially better. I'm struggling to sort out how to do it in the Whisper.CPP version, but it seems like "whisper_exp_compute_token_level_timestamps" needs to be replaced with something similar to what's in the "timing.py" of OpenAI's implementation.

iceychris commented 1 year ago

I'd love to help with implementing OpenAI's per-word timestamps approach based on DTW and cross-attention weights in whisper.cpp.

I think the main steps required for this consist of:

implementing DTW transform in ggml (or whisper.cpp)
collecting all the things (cross-attention weights, tokens, alignment heads) from the right places
and writing a new top-level function like whisper_compute_word_alignment containing the logic of this function

Is this on the roadmap and is anyone willing to collaborate on this?

bmurray commented 1 year ago

I think the roadmap is pretty open to whatever you want to contribute. I don't know of anyone else working on it.

I did take a look at trying to implement it, but found that I just don't know the inner workings of GGML and PyTorch well enough to build something that won't be a total mess. I'm definitely willing to collaborate on it, but I'm not sure how much use I can be.

ggerganov commented 1 year ago

Would be great to implement this in whisper.cpp and I want to give it a try, but I won't be able to work on this anytime soon as there are more things with higher priority in llama.cpp. If anyone is interested - please open a PR and we can discuss the implementation.

From what I remember, DTW is a dynamic programming algorithm and it's implementation should be part of whisper.cpp. Can be implemented as a first step with some unit tests to make sure it works correctly.

denersc commented 1 year ago

I would like to try my hand at this, would you be willing to offer me some guidance @ggerganov ?

I'll probably start as suggested, implementing the DTW algorithm (on whisper.cpp file, correct?) and some tests (maybe a dtw.cpp in tests folder? I'm open to suggestions.). I'll create a PR as soon as i have DTW figured out so we can go from there.

What i will probably need help figuring out is the information collection. Two points in special trouble me:

We need to retrieve the output of the decoder cross-attention layers. How hard would it be to cache these outputs when executing inference (e.g. saving them on whisper_state) so they could be used when computing our timestamps? e.g. in some whisper_compute_token_level_timestamps function ran at the end of whisper_full_with_state
In the original OpenAI impl they have hard-coded boolean arrays for each model size that indicate which cross-attention heads are highly correlated with timing (i.e. the alignment heads). Apparently these are the only cross-attention outputs actually used when computing DTW

# base85-encoded (n_layers, n_heads) boolean arrays indicating the cross-attention heads that are
# highly correlated to the word-level timing, i.e. the alignment between audio and text tokens.
_ALIGNMENT_HEADS = {
    "tiny.en": b"ABzY8J1N>@0{>%R00Bk>$p{7v037`oCl~+#00",
    "tiny": b"ABzY8bu8Lr0{>%RKn9Fp%m@SkK7Kt=7ytkO",
    "base.en": b"ABzY8;40c<0{>%RzzG;p*o+Vo09|#PsxSZm00",
    "base": b"ABzY8KQ!870{>%RzyTQH3`Q^yNP!>##QT-<FaQ7m",
    "small.en": b"ABzY8>?_)10{>%RpeA61k&I|OI3I$65C{;;pbCHh0B{qLQ;+}v00",
    "small": b"ABzY8DmU6=0{>%Rpa?J`kvJ6qF(V^F86#Xh7JUGMK}P<N0000",
    "medium.en": b"ABzY8usPae0{>%R7<zz_OvQ{)4kMa0BMw6u5rT}kRKX;$NfYBv00*Hl@qhsU00",
    "medium": b"ABzY8B0Jh+0{>%R7}kK1fFL7w6%<-Pf*t^=N)Qr&0RR9",
    "large-v1": b"ABzY8r9j$a0{>%R7#4sLmoOs{s)o3~84-RPdcFk!JR<kSfC2yj",
    "large-v2": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
    "large": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
}

Considering the conversion between PyTorch to ggml, would these indexes still point to the same attention heads?

mrienstra commented 5 months ago

Now that #1485 -- great work @denersc! -- has merged, seems like it would be prudent to summarize outstanding tasks needed to close this issue.

ggerganov / whisper.cpp

Implement word-level timestamps approach proposed by OpenAI #375