Closed stasbel closed 6 months ago
You read our minds 🤯 we're actively working on a similar implementation of this paper via #59, stay tuned!
@stasbel Let us know if you would like to collaborate with us on this. Sounds like we are working on similar things :)
I am eager to try to implement the paper
basically, for now, my ask was to have prompt: String
as convinient argument for decoding func :)
Also part of this issue #53
@ZachNagengast @atiorh fyi: I started testing more thoroughly and two problems arise:
some words timestamps are off: there could be cases when word.end > len(audio)
or word.end < word.start
do you consider adding some kind of tests/assert like all(0 <= w.start <= w.end <= (len(audio) / sr) for w in words)
it's just easier to work with that kind of assumption and catch bugs
is there a way to supress/filter out non-speech words or tokens? examples: [BLANK_AUDIO]
or (speaking in foreign language)
or similar; the reason is I want pure words to work with
Apologies missed this response. We have the prompt parameter coming in early next week as well as a demo implementation of the streaming logic. We also have improved word accuracy tests with this, but I'm curious - do you have an example that we can test where the word timestamps are off? Ideally one where the same timestamps are correct via the python implementation from openai.
We do have the suppress tokens logits filter, but I don't think that would do what you want in regard to suppressing sequences of tokens - if you're aware of any approaches that can do this do let us know.
thx for the response! will wait for prompt param then
regarding timestamps: I will try to capture and share specific audio for reproducing, so far I just filter them out
as I see now from testing and logging stuff 1) almost all [BLANK
and _AUDIO]
are incorrect 2) some small 2-3-4 chars words could be incorrect as well
some example from my logging [start]text[end]{prob}
:
invalid word: [29.46]_AUDIO][4.59]{100%}
invalid word: [28.26]And...[1.45]{5%}
invalid word: [4.10]I...[4.05]{0%}
invalid word: [28.50][BLANK[1.25]{28%}
invalid word: [29.46]_AUDIO][1.25]{99%}
invalid word: [2.42]test.[2.38]{0%}
invalid word: [28.50][BLANK[1.56]{38%}
invalid word: [29.46]_AUDIO][1.56]{99%}
invalid word: [28.50][BLANK[2.13]{30%}
invalid word: [29.46]_AUDIO][2.13]{99%}
invalid word: [28.50][BLANK[2.71]{30%}
regarding supressing tokens: okay, will look into openai implementation; don't remember I ever see (speaking foreign language)
or (door bell)
from them, locally or API, so it should be something simple
@stasbel Were you ever able to capture any [BLANK_AUDIO] examples?
@stasbel Along with the prompting, there was also an issue with punctuation merging that got fixed with #95, so now words like [BLANK_AUDIO] will all be merged together and easier to filter out
need it for real-time stuff like: https://arxiv.org/pdf/2307.14743.pdf p.s. great project!