argmaxinc / WhisperKit

On-device Speech Recognition for Apple Silicon
https://takeargmax.com/blog/whisperkit
MIT License
3.17k stars 268 forks source link

Ability to pass prompt text when transcribing #75

Closed stasbel closed 6 months ago

stasbel commented 6 months ago

need it for real-time stuff like: https://arxiv.org/pdf/2307.14743.pdf p.s. great project!

ZachNagengast commented 6 months ago

You read our minds 🤯 we're actively working on a similar implementation of this paper via #59, stay tuned!

atiorh commented 6 months ago

@stasbel Let us know if you would like to collaborate with us on this. Sounds like we are working on similar things :)

stasbel commented 6 months ago

I am eager to try to implement the paper basically, for now, my ask was to have prompt: String as convinient argument for decoding func :)

ZachNagengast commented 6 months ago

Also part of this issue #53

stasbel commented 6 months ago

@ZachNagengast @atiorh fyi: I started testing more thoroughly and two problems arise:

  1. some words timestamps are off: there could be cases when word.end > len(audio) or word.end < word.start do you consider adding some kind of tests/assert like all(0 <= w.start <= w.end <= (len(audio) / sr) for w in words) it's just easier to work with that kind of assumption and catch bugs

  2. is there a way to supress/filter out non-speech words or tokens? examples: [BLANK_AUDIO] or (speaking in foreign language) or similar; the reason is I want pure words to work with

ZachNagengast commented 6 months ago

Apologies missed this response. We have the prompt parameter coming in early next week as well as a demo implementation of the streaming logic. We also have improved word accuracy tests with this, but I'm curious - do you have an example that we can test where the word timestamps are off? Ideally one where the same timestamps are correct via the python implementation from openai.

We do have the suppress tokens logits filter, but I don't think that would do what you want in regard to suppressing sequences of tokens - if you're aware of any approaches that can do this do let us know.

stasbel commented 6 months ago

thx for the response! will wait for prompt param then

regarding timestamps: I will try to capture and share specific audio for reproducing, so far I just filter them out as I see now from testing and logging stuff 1) almost all [BLANK and _AUDIO] are incorrect 2) some small 2-3-4 chars words could be incorrect as well some example from my logging [start]text[end]{prob}:

invalid word: [29.46]_AUDIO][4.59]{100%}
invalid word: [28.26]And...[1.45]{5%}
invalid word: [4.10]I...[4.05]{0%}
invalid word: [28.50][BLANK[1.25]{28%}
invalid word: [29.46]_AUDIO][1.25]{99%}
invalid word: [2.42]test.[2.38]{0%}
invalid word: [28.50][BLANK[1.56]{38%}
invalid word: [29.46]_AUDIO][1.56]{99%}
invalid word: [28.50][BLANK[2.13]{30%}
invalid word: [29.46]_AUDIO][2.13]{99%}
invalid word: [28.50][BLANK[2.71]{30%}

regarding supressing tokens: okay, will look into openai implementation; don't remember I ever see (speaking foreign language) or (door bell) from them, locally or API, so it should be something simple

ZachNagengast commented 6 months ago

@stasbel Were you ever able to capture any [BLANK_AUDIO] examples?

ZachNagengast commented 6 months ago

@stasbel Along with the prompting, there was also an issue with punctuation merging that got fixed with #95, so now words like [BLANK_AUDIO] will all be merged together and easier to filter out