Open ggerganov opened 1 year ago
This came up in my "explore" feed as a way to implement accurate word-level timestamps: https://github.com/m-bain/whisperX
There is a functioning implementation of the attention weights approach here: https://github.com/linto-ai/whisper-timestamped which might be a useful reference for implementing in whisper.cpp
eventually.
The whisper python module itself provided a time-stamp output option, which could be a reference, and I tested it, the command is:
python -m whisper --model tiny --language en --word_timestamps True --output_dir "test_out" "test.wav"
it generated 5 files in the test_out folder:
test.json
test.txt
test.srt
test.vtt
test.tsv
In the test.json file, the content is:
{
"text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place. I'm going to be recognized, people are going to know why I am. Now I'm here, I'm on vacation, I'm with my family, I just want to have the money back. I just want to be a normal person, so... I'm going to go to the kitchen. I'm at this girl's girl, she seemed to hit me tight. The need to surface she was more about her perfect life. It's not the best thing, she drives the main thing. And when I'm dreaming her to scream and daddy make it. She's a gold digger lover, she got it from her mom. She's never stepfather body, all they should want. She's a gold digger lover, she's a gold digger lover. If you recognize me now, don't you? I'm the only one. So, real life, nobody really knows who the heck I am. So, I have a plan, but I gotta make myself known. I gotta do this somehow, I gotta get my name out there. She's a gold digger lover, she's a gold digger lover. She's a gold digger lover, she's a gold digger lover. Have you heard of her? You never heard of her? Oh, it's great. She won't do anything like gold in time and fame. But pop up on a mark, the birds will fade away with a ramycine. Last blow, sing, last tip of two, she could last. By the way, do you know what's going to be? Can I get caught by a goodness? I don't know. She's a gold digger lover, she's been on the cover, she's a brand. No, pop up on a mark, you just look her up, she's great. She's a gold digger lover, she's a gold digger lover, she's a gold digger lover. Thank you. Thank you. Thank you. I thought it was a party party party, can you? So, okay, New York City, you may not know me yet. And all, as I've learned, you may not have heard of Lindsey Sterling, he hit my violinist before. What? Think you're looking up for me. Think you're on the bright side. Hello, how you doing? Yes. Okay. So, subscribe to my YouTube channel. Stop it this drought. I'm just gonna be fine. I got some great stuff coming through away. Do. Ace that. More come. Yeah. Let me surely sign me out.",
"segments": [
{
"id": 0,
"seek": 0,
"start": 3.6,
"end": 10.6,
"text": " So, here's a great city of New York, and I realized now, going out public is a big, busy place.",
"tokens": [
50364, 407, 11, 510, 311, 257, 869, 2307, 295, 1873, 3609, 11, 293, 286, 5334, 586, 11, 516, 484, 1908, 307, 257, 955, 11, 5856, 1081, 13, 50914
],
"temperature": 0.0,
"avg_logprob": -0.5041744733097577,
"compression_ratio": 1.5665024630541873,
"no_speech_prob": 0.08891408145427704,
"words": [
{"word": " So,","start": 3.6,"end": 3.96,"probability": 0.5301069021224976},
{"word": " here's","start": 3.42,"end": 4.32,"probability": 0.6140210628509521},
{"word": " a","start": 4.32,"end": 4.42,"probability": 0.1545887440443039},
{"word": " great","start": 4.42,"end": 4.7,"probability": 0.6114427447319031},
{"word": " city","start": 4.7,"end": 5.08,"probability": 0.9124268293380737},
{"word": " of","start": 5.08,"end": 5.36,"probability": 0.9507943987846375},
{"word": " New","start": 5.36,"end": 5.44,"probability": 0.9982349872589111},
{"word": " York,","start": 5.44,"end": 6.18,"probability": 0.9951660633087158},
{"word": " and","start": 6.44,"end": 6.56,"probability": 0.9580233097076416},
{"word": " I","start": 6.56,"end": 6.66,"probability": 0.5875958204269409},
{"word": " realized","start": 6.66,"end": 7.02,"probability": 0.5471060872077942},
{"word": " now,","start": 7.02,"end": 7.86,"probability": 0.6020179390907288},
{"word": " going","start": 8.04,"end": 8.12,"probability": 0.7494494318962097},
{"word": " out","start": 8.12,"end": 8.38,"probability": 0.9883183240890503},
{"word": " public","start": 8.38,"end": 8.72,"probability": 0.6699197888374329},
{"word": " is","start": 8.72,"end": 8.98,"probability": 0.3241350054740906},
{"word": " a","start": 8.98,"end": 9.14,"probability": 0.7641012072563171},
{"word": " big,","start": 9.14,"end": 9.5,"probability": 0.4375719726085663},
{"word": " busy","start": 9.5,"end": 9.94,"probability": 0.6939781308174133},
{"word": " place.","start": 9.94,"end": 10.6,"probability": 0.8924348950386047}
]
},
{
"id": 1,
"seek": 0,
"start": 11.7,
"end": 15.16,
"text": " I'm going to be recognized, people are going to know why I am.",
"tokens": [
50914, 286, 478, 516, 281, 312, 9823, 11, 561, 366, 516, 281, 458, 983, 286, 669, 13, 51114
],
"temperature": 0.0,
"avg_logprob": -0.5041744733097577,
"compression_ratio": 1.5665024630541873,
"no_speech_prob": 0.08891408145427704,
"words": [
{"word": " I'm","start": 11.7,"end": 11.8,"probability": 0.980172872543335},
{"word": " going","start": 11.8,"end": 11.94,"probability": 0.32428041100502014},
{"word": " to","start": 11.94,"end": 12.04,"probability": 0.9828474521636963},
{"word": " be","start": 12.04,"end": 12.16,"probability": 0.9843984842300415},
{"word": " recognized,","start": 12.16,"end": 12.58,"probability": 0.3810001611709595},
{"word": " people","start": 13.22,"end": 13.5,"probability": 0.9561352729797363},
{"word": " are","start": 13.5,"end": 13.6,"probability": 0.9821558594703674},
{"word": " going","start": 13.6,"end": 13.78,"probability": 0.7550729513168335},
{"word": " to","start": 13.78,"end": 13.8,"probability": 0.9977655410766602},
{"word": " know","start": 13.8,"end": 14.0,"probability": 0.9933110475540161},
{"word": " why","start": 14.0,"end": 14.32,"probability": 0.7471684813499451},
{"word": " I","start": 14.32,"end": 14.58,"probability": 0.31861186027526855},
{"word": " am.","start": 14.58,"end": 15.16,"probability": 0.9440820217132568}
]
}
],
"language": "en"
}
From a practicle view, the json word-timestamp file is quite useful.
The method used to get per-word timestamps is pretty bad. The python version is substantially better. I'm struggling to sort out how to do it in the Whisper.CPP version, but it seems like "whisper_exp_compute_token_level_timestamps" needs to be replaced with something similar to what's in the "timing.py" of OpenAI's implementation.
I'd love to help with implementing OpenAI's per-word timestamps approach based on DTW and cross-attention weights in whisper.cpp
.
I think the main steps required for this consist of:
whisper.cpp
)whisper_compute_word_alignment
containing the logic of this functionIs this on the roadmap and is anyone willing to collaborate on this?
I think the roadmap is pretty open to whatever you want to contribute. I don't know of anyone else working on it.
I did take a look at trying to implement it, but found that I just don't know the inner workings of GGML and PyTorch well enough to build something that won't be a total mess. I'm definitely willing to collaborate on it, but I'm not sure how much use I can be.
Would be great to implement this in whisper.cpp
and I want to give it a try, but I won't be able to work on this anytime soon as there are more things with higher priority in llama.cpp
. If anyone is interested - please open a PR and we can discuss the implementation.
From what I remember, DTW is a dynamic programming algorithm and it's implementation should be part of whisper.cpp
. Can be implemented as a first step with some unit tests to make sure it works correctly.
I would like to try my hand at this, would you be willing to offer me some guidance @ggerganov ?
I'll probably start as suggested, implementing the DTW algorithm (on whisper.cpp
file, correct?) and some tests (maybe a dtw.cpp in tests
folder? I'm open to suggestions.). I'll create a PR as soon as i have DTW figured out so we can go from there.
What i will probably need help figuring out is the information collection. Two points in special trouble me:
We need to retrieve the output of the decoder cross-attention layers. How hard would it be to cache these outputs when executing inference (e.g. saving them on whisper_state
) so they could be used when computing our timestamps? e.g. in some whisper_compute_token_level_timestamps
function ran at the end of whisper_full_with_state
In the original OpenAI impl they have hard-coded boolean arrays for each model size that indicate which cross-attention heads are highly correlated with timing (i.e. the alignment heads). Apparently these are the only cross-attention outputs actually used when computing DTW
# base85-encoded (n_layers, n_heads) boolean arrays indicating the cross-attention heads that are
# highly correlated to the word-level timing, i.e. the alignment between audio and text tokens.
_ALIGNMENT_HEADS = {
"tiny.en": b"ABzY8J1N>@0{>%R00Bk>$p{7v037`oCl~+#00",
"tiny": b"ABzY8bu8Lr0{>%RKn9Fp%m@SkK7Kt=7ytkO",
"base.en": b"ABzY8;40c<0{>%RzzG;p*o+Vo09|#PsxSZm00",
"base": b"ABzY8KQ!870{>%RzyTQH3`Q^yNP!>##QT-<FaQ7m",
"small.en": b"ABzY8>?_)10{>%RpeA61k&I|OI3I$65C{;;pbCHh0B{qLQ;+}v00",
"small": b"ABzY8DmU6=0{>%Rpa?J`kvJ6qF(V^F86#Xh7JUGMK}P<N0000",
"medium.en": b"ABzY8usPae0{>%R7<zz_OvQ{)4kMa0BMw6u5rT}kRKX;$NfYBv00*Hl@qhsU00",
"medium": b"ABzY8B0Jh+0{>%R7}kK1fFL7w6%<-Pf*t^=N)Qr&0RR9",
"large-v1": b"ABzY8r9j$a0{>%R7#4sLmoOs{s)o3~84-RPdcFk!JR<kSfC2yj",
"large-v2": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
"large": b"ABzY8zd+h!0{>%R7=D0pU<_bnWW*tkYAhobTNnu$jnkEkXqp)j;w1Tzk)UH3X%SZd&fFZ2fC2yj",
}
Considering the conversion between PyTorch to ggml, would these indexes still point to the same attention heads?
Now that #1485 -- great work @denersc! -- has merged, seems like it would be prudent to summarize outstanding tasks needed to close this issue.
See notebook, section "Word-level timestamps using attention weights":
https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb