Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold

kjhenner commented 8 months ago

Running the medium English model with VAD enabled, I've noticed a tendency to hallucinate phrases like "Thanks for watching!", "Thanks!", "That's all," etc.. I assume it's receiving some audio data just above the VAD threshold but without any intelligible speech. Not really sure about the inner workings of the Whisper model, but it makes some sense that the model would be biased towards seeing these kinds of phrases at the end of training data? Maybe there's an end token that's putting a lot of probability on those phrases in the absence of any other intelligible data?

I haven't looked at the VAD code at all, so I'm not really sure what the approach would be to address this, but it'd be nice if it's fixable!

Ye83 commented 8 months ago

The same thing happened to me, and it would be great to fix it

makaveli10 commented 8 months ago

Can you guys checkout no_speech_threshold here https://github.com/collabora/WhisperLive/blob/8d77f0fa5a83a1236e23198fea33f62ad9a26460/whisper_live/server.py#L714

and try to change it and see if results are better.

kjhenner commented 8 months ago

Thanks, I'll take a look!

Ye83 commented 8 months ago

Did you solve it? What value have you set? @kjhenner

kjhenner commented 8 months ago

Nothing yet--just gotta get this going on my local system so I can experiment a little. I'll let you know if I find a good solution.

Siim commented 8 months ago

I faced a comparable issue with the TalTechNLP/whisper-large-et model. To tackle the problem in my Node.js testing application, I utilized Silero VAD for initial speech detection. However, the model still encountered difficulties, hallucinating and generating random text, even when no data was sent to Whisper Live.

Notably, there was a noticeable improvement in the model's performance when I transmitted empty data to Whisper Live whenever no speech was detected. This approach is highlighted in my test code with the condition if (!speaking) { ...


import { WebSocket } from "ws"
import { v4 as uuidv4 } from 'uuid'
import { logger } from './src/logger'
import { SpeechDetector } from "./src/vad/speechDetector"

const { spawn } = require('child_process');
const ffmpeg = spawn('ffmpeg', [
  '-f', 'avfoundation',
  '-i', ':0', // Make sure the index matches your device
  '-ac', '1',  // Capture in mono
  '-ar', '16000',  // Set sample rate to 16kHz
  '-f', 's16le',  // Set format to signed 16-bit little-endian
  '-af highpass=f=300,asendcmd=0.0 afftdn sn start,asendcmd=1.5 afftdn sn stop,afftdn=nf=-20,dialoguenhance,lowpass=f=3000',
  '-'
])

function bufferToFloat32Array(buffer: Buffer) {
  const data = new Int16Array(buffer.buffer, buffer.byteOffset, buffer.length / Int16Array.BYTES_PER_ELEMENT);
  const float32Array = new Float32Array(data.length);
  for (let i = 0; i < data.length; i++) {
    float32Array[i] = data[i] / 0x8000;
  }
  return float32Array;
}

let speaking = false
SpeechDetector.create(0.9, 0.75).then((speechDetector) => {
  speechDetector.readFromStream(ffmpeg.stdout as any).then(() => {

    speechDetector.on('speechStart', (start: number) => {
      speaking = true
      console.log('Speech start:', start)
    })

    speechDetector.on('speechEnd', (end: number) => {
      speaking = false
      console.log('Speech end:', end)
    })
  })
})

type Transcript = {
  uid: string
  message: string
  segments: Array<{
    start: string
    end: string
    text: string
  }>
}

const whisper = new WebSocket('ws://46.227.xxx.xxx:24882')
const uid = uuidv4()

whisper.on('open', () => {
  logger.info('Whisper connection open')
  whisper.send(
    JSON.stringify({
      uid,
      language: "et",
      task: "transcribe",
      use_vad: true
    })
  )
})

whisper.onmessage = (event) => {
  const data: Transcript = JSON.parse(event.data.toString())
  if (data.uid !== uid) return // ignore messages that are not for this recording
  if (data?.message && data?.message === 'SERVER_READY') {
    console.log('Server ready')
    return
  }

  if (data.message === 'DISCONNECTED') {
    console.log('Server disconnected')
    whisper.close()
    return
  }
  console.log(data.segments)
}

whisper.on('open', async () => {
  let windowSizeSamples = 512
  let sampleBuffer = new Float32Array(windowSizeSamples); // Buffer for accumulating samples
  let bufferIndex = 0; // Index for the next sample in the buffer

  ffmpeg.stdout.on('data', (chunk: Buffer) => {
    if (whisper.readyState !== WebSocket.OPEN) {
      logger.error('Whisper not open')
      return
    }
    if (!speaking) {
      whisper.send(new Float32Array(80000).buffer) // 5 x 16k samples of empty data
      return
    }
    const audioData = bufferToFloat32Array(chunk)
    for (let sample of audioData) {
      sampleBuffer[bufferIndex++] = sample;
      if (bufferIndex === windowSizeSamples) {
        whisper.send(Buffer.from(sampleBuffer.buffer))
        bufferIndex = 0
        sampleBuffer = new Float32Array(windowSizeSamples)
      }
    }
  })
})

I'm still not free from the issues, so experimenting with different params.

Hkaisense commented 7 months ago

Isn't there an expert who can solve this annoying problem?

collabora / WhisperLive

Hallucinating conclusive remarks ("Thanks for watching!", "That's all!", etc.) with non-speech noise just above VAD threshold #185