Open kjhenner opened 8 months ago
The same thing happened to me, and it would be great to fix it
Can you guys checkout no_speech_threshold
here https://github.com/collabora/WhisperLive/blob/8d77f0fa5a83a1236e23198fea33f62ad9a26460/whisper_live/server.py#L714
and try to change it and see if results are better.
Thanks, I'll take a look!
Did you solve it? What value have you set? @kjhenner
Nothing yet--just gotta get this going on my local system so I can experiment a little. I'll let you know if I find a good solution.
I faced a comparable issue with the TalTechNLP/whisper-large-et model. To tackle the problem in my Node.js testing application, I utilized Silero VAD for initial speech detection. However, the model still encountered difficulties, hallucinating and generating random text, even when no data was sent to Whisper Live.
Notably, there was a noticeable improvement in the model's performance when I transmitted empty data to Whisper Live whenever no speech was detected. This approach is highlighted in my test code with the condition if (!speaking) { ...
import { WebSocket } from "ws"
import { v4 as uuidv4 } from 'uuid'
import { logger } from './src/logger'
import { SpeechDetector } from "./src/vad/speechDetector"
const { spawn } = require('child_process');
const ffmpeg = spawn('ffmpeg', [
'-f', 'avfoundation',
'-i', ':0', // Make sure the index matches your device
'-ac', '1', // Capture in mono
'-ar', '16000', // Set sample rate to 16kHz
'-f', 's16le', // Set format to signed 16-bit little-endian
'-af highpass=f=300,asendcmd=0.0 afftdn sn start,asendcmd=1.5 afftdn sn stop,afftdn=nf=-20,dialoguenhance,lowpass=f=3000',
'-'
])
function bufferToFloat32Array(buffer: Buffer) {
const data = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / Int16Array.BYTES_PER_ELEMENT);
const float32Array = new Float32Array(data.length);
for (let i = 0; i < data.length; i++) {
float32Array[i] = data[i] / 0x8000;
}
return float32Array;
}
let speaking = false
SpeechDetector.create(0.9, 0.75).then((speechDetector) => {
speechDetector.readFromStream(ffmpeg.stdout as any).then(() => {
speechDetector.on('speechStart', (start: number) => {
speaking = true
console.log('Speech start:', start)
})
speechDetector.on('speechEnd', (end: number) => {
speaking = false
console.log('Speech end:', end)
})
})
})
type Transcript = {
uid: string
message: string
segments: Array<{
start: string
end: string
text: string
}>
}
const whisper = new WebSocket('ws://46.227.xxx.xxx:24882')
const uid = uuidv4()
whisper.on('open', () => {
logger.info('Whisper connection open')
whisper.send(
JSON.stringify({
uid,
language: "et",
task: "transcribe",
use_vad: true
})
)
})
whisper.onmessage = (event) => {
const data: Transcript = JSON.parse(event.data.toString())
if (data.uid !== uid) return // ignore messages that are not for this recording
if (data?.message && data?.message === 'SERVER_READY') {
console.log('Server ready')
return
}
if (data.message === 'DISCONNECTED') {
console.log('Server disconnected')
whisper.close()
return
}
console.log(data.segments)
}
whisper.on('open', async () => {
let windowSizeSamples = 512
let sampleBuffer = new Float32Array(windowSizeSamples); // Buffer for accumulating samples
let bufferIndex = 0; // Index for the next sample in the buffer
ffmpeg.stdout.on('data', (chunk: Buffer) => {
if (whisper.readyState !== WebSocket.OPEN) {
logger.error('Whisper not open')
return
}
if (!speaking) {
whisper.send(new Float32Array(80000).buffer) // 5 x 16k samples of empty data
return
}
const audioData = bufferToFloat32Array(chunk)
for (let sample of audioData) {
sampleBuffer[bufferIndex++] = sample;
if (bufferIndex === windowSizeSamples) {
whisper.send(Buffer.from(sampleBuffer.buffer))
bufferIndex = 0
sampleBuffer = new Float32Array(windowSizeSamples)
}
}
})
})
I'm still not free from the issues, so experimenting with different params.
Isn't there an expert who can solve this annoying problem?
Running the medium English model with VAD enabled, I've noticed a tendency to hallucinate phrases like "Thanks for watching!", "Thanks!", "That's all," etc.. I assume it's receiving some audio data just above the VAD threshold but without any intelligible speech. Not really sure about the inner workings of the Whisper model, but it makes some sense that the model would be biased towards seeing these kinds of phrases at the end of training data? Maybe there's an end token that's putting a lot of probability on those phrases in the absence of any other intelligible data?
I haven't looked at the VAD code at all, so I'm not really sure what the approach would be to address this, but it'd be nice if it's fixable!