bugbakery / transcribee

open source audio and video transcription software
https://transcribee.net
GNU Affero General Public License v3.0
296 stars 19 forks source link

Non speech text #25

Open rroohhh opened 1 year ago

rroohhh commented 1 year ago

There are several non speech segments generated by whisper. Broadly these come in three classes:

The first two classes are easy to detect and if necessary for some processing step split out. Punctuation is sampled from a limited class of possible characters (like ., `,,,-`, ...) and the special tokens have specific token IDs that can be filtered out.

The third class however is generated just like "normal" transcript text. The official whisper implementation seems to have some heuristics to filter them: https://github.com/openai/whisper/blob/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/tokenizer.py#L237 However this also just looks like some basic heuristic, that will not work for *Musik* for example.

Do we care about this third class? Are there cases where we want to filter these? One case where I think it could be useful to filter them is the alignment, but it is to be determined how important this is.

moeffju commented 1 year ago

When I transcribed all cccamp23 talks using whisper.cpp and the small model, it would almost always encode non-speech parts wrapped in brackets, like [Musik] or [applause]. I just tried the current dev transcribee and it seems that it just merges tokens like that with following text, whereas my local whisper.cpp cli would usually put them on a separate timestamped line. So, two things that might help here,