By removing segments with the same text and probability, we can reduce the hallucination by around 50%.
We also correct the Whisper prompt. Whisper, unlike ChatGPT, is not instruction tuned. It is meaningless to provide commands inside its prompt. Therefore, we just include the bare minimum text to lead the transcript to use Taiwanese Mandarin (ex: 網際網路 instead of 互联网、影片 instead of 视频) and full-width punctuations.
Examples
Before
After
Some hallucination is being deduped, but still some remains
Before
After
Most hallucination is being deduped, but the sentence still repeats itself for one time.
Known issue
This does not help for video without any voice, as its hallucination does not repeat itself.
In reality, even if a voice repeats a certain phrase multiple times, the probability (confidence) of speech recognition should not be the same.
However, for hallucinated text, sometimes the text and probability just repeats themselves. Example results: https://docs.google.com/spreadsheets/d/10xfkOZpGJ-9vIvoYziEkD1lZETWMbBLDT-NABdQ8H_g/edit#gid=0&range=32:34
By removing segments with the same text and probability, we can reduce the hallucination by around 50%.
We also correct the Whisper prompt. Whisper, unlike ChatGPT, is not instruction tuned. It is meaningless to provide commands inside its prompt. Therefore, we just include the bare minimum text to lead the transcript to use Taiwanese Mandarin (ex: 網際網路 instead of 互联网、影片 instead of 视频) and full-width punctuations.
Examples
Before
After
Some hallucination is being deduped, but still some remains
Before
After
Most hallucination is being deduped, but the sentence still repeats itself for one time.
Known issue
This does not help for video without any voice, as its hallucination does not repeat itself.