Non speech text - Githubissues

There are several non speech segments generated by whisper. Broadly these come in three classes:

Punctuation
Special tokens (like start of transcript, end of transcript, etc)
Special annotations (like *Musik*, ..., etc)

The first two classes are easy to detect and if necessary for some processing step split out. Punctuation is sampled from a limited class of possible characters (like ., `,,,-`, ...) and the special tokens have specific token IDs that can be filtered out.

The third class however is generated just like "normal" transcript text. The official whisper implementation seems to have some heuristics to filter them: https://github.com/openai/whisper/blob/ad3250a846fe7553a25064a2dc593e492dadf040/whisper/tokenizer.py#L237 However this also just looks like some basic heuristic, that will not work for *Musik* for example.

Do we care about this third class? Are there cases where we want to filter these? One case where I think it could be useful to filter them is the alignment, but it is to be determined how important this is.

bugbakery / transcribee

Non speech text #25