alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.36k stars 1.04k forks source link

Valid wav file is returning an empty text as transcription #1504

Open jrichardsz opened 5 months ago

jrichardsz commented 5 months ago

Expected Behavior

Receive the wav file stream and get the transcription

Current Behavior

rec.result() is returning an empty text {"text":""}

if (rec.acceptWaveform(message)){      
  console.log(JSON.stringify(rec.result()));
}

Steps to Reproduce

  1. I'm creating a valid wav file from web browser using javascript first obtaining a buffer then creating a blob
  2. Send the wav file to the nodejs (socket-io server)
  3. Get the Buffer <Buffer@0x6260d30 52 49 46 46 24 da 0e 00 57 41 5 ... and pass it to the vosk recognition instance
@SocketIoEvent(eventName = "send-audio")
this.sendAudio = async (message, currentSocket, globalSocket) => {
  var id = uuidv4();
  console.log(id);
  console.log(message);
  var wavLocation = `/tmp/${id}.wav`;
  await fs.promises.writeFile(wavLocation, message);
  console.log(wavLocation)

  //initRecognizer(); singleton
  if (rec.acceptWaveform(message)){      
    console.log(JSON.stringify(rec.result()));
  }else{
    console.log("not a wave format")
  }
}

Context (Environment)

Additional information

with browser

image

With ffmpeg

ffprobe -loglevel error -show_streams -i /tmp/19e2f4ce-bd22-4af2-8eac-180fbdd1cfc8.wav

index=0
codec_name=pcm_s16le
codec_long_name=PCM signed 16-bit little-endian
profile=unknown
codec_type=audio
codec_tag_string=[1][0][0][0]
codec_tag=0x0001
sample_fmt=s16
sample_rate=48000
channels=1
channel_layout=unknown
bits_per_sample=16
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/48000
start_pts=N/A
start_time=N/A
duration_ts=173696
duration=3.618667
bit_rate=768000
max_bit_rate=N/A
bits_per_raw_sample=N/A
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
[/STREAM]

With https://www.npmjs.com/package/wavefile

{
  container: 'RIFF',
  chunkSize: 347428,
  format: 'WAVE',
  Y: {
    chunkId: 'RIFF',
    chunkSize: 347428,
    format: 'WAVE',
    subChunks: [ [Object], [Object] ]
  },
  c: 36,
  a: { h: 32, o: false },
  Z: [ 'RIFF', 'RIFX', 'RF64' ],
  fmt: {
    chunkId: 'fmt ',
    chunkSize: 16,
    audioFormat: 1,
    numChannels: 1,
    sampleRate: 48000,
    byteRate: 96000,
    blockAlign: 2,
    bitsPerSample: 16,
    cbSize: 0,
    validBitsPerSample: 0,
    dwChannelMask: 0,
    subformat: []
  },
  fact: { chunkId: '', chunkSize: 0, dwSampleLength: 0 },
  cue: { chunkId: '', chunkSize: 0, dwCuePoints: 0, points: [] },
  smpl: {
    chunkId: '',
    chunkSize: 0,
    dwManufacturer: 0,
    dwProduct: 0,
    dwSamplePeriod: 0,
    dwMIDIUnityNote: 0,
    dwMIDIPitchFraction: 0,
    dwSMPTEFormat: 0,
    dwSMPTEOffset: 0,
    dwNumSampleLoops: 0,
    dwSamplerData: 0,
    loops: []
  },
  bext: {
    chunkId: '',
    chunkSize: 0,
    description: '',
    originator: '',
    originatorReference: '',
    originationDate: '',
    originationTime: '',
    timeReference: [ 0, 0 ],
    version: 0,
    UMID: '',
    loudnessValue: 0,
    loudnessRange: 0,
    maxTruePeakLevel: 0,
    maxMomentaryLoudness: 0,
    maxShortTermLoudness: 0,
    reserved: '',
    codingHistory: ''
  },
  iXML: { chunkId: '', chunkSize: 0, value: '' },
  ds64: {
    chunkId: '',
    chunkSize: 0,
    riffSizeHigh: 0,
    riffSizeLow: 0,
    dataSizeHigh: 0,
    dataSizeLow: 0,
    originationTime: 0,
    sampleCountHigh: 0,
    sampleCountLow: 0
  },
  data: {
    chunkId: 'data',
    chunkSize: 347392,
    samples: <Buffer@0x6da188c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... 347342 more bytes>
  },
  LIST: [],
  junk: { chunkId: '', chunkSize: 0, chunkData: [] },
  _PMX: { chunkId: '', chunkSize: 0, value: '' },
  g: { h: 16, o: false, O: false, R: false },
  bitDepth: '16',
  f: { h: 16, R: false, O: true, o: false },
  G: {
    '4': 17,
    '8': 1,
    '16': 1,
    '24': 1,
    '32': 1,
    '64': 3,
    '8a': 6,
    '8m': 7,
    '32f': 3
  }
}
/**
  * Accept voice data
  *
  * accept and process new chunk of voice data
  *
  * @param {Buffer} data audio data in PCM 16-bit mono format
  * @returns {boolean} true if silence is occured and you can retrieve a new utterance with result method
  */
acceptWaveform(data: Buffer): boolean;
nshmyrev commented 5 months ago

demo Recognizer is initialized with 16khz sample rate, your file is 48khz, you probably need to check the line where you create the recognizer if it has proper sample rate.

jrichardsz commented 5 months ago

Thanks for your help.

I will review the sample rate and share the result.

jrichardsz commented 5 months ago

I changed the initial sample rate from 16000 to 48000

  var sampleRate = 48000
  var rec;

  initRecognizer=()=>{
    if(typeof rec !== 'undefined' ) return;
    vosk.setLogLevel(0);
    const model = new vosk.Model(this.configuration.vosk_model_path);
    rec = new vosk.Recognizer({model: model, sampleRate: sampleRate});  
  }

But now, rec.acceptWaveform(message) returned false. Just to try I returned to 16000 and at least rec.acceptWaveform(message) returns true but with empty result


Thanks for your help

jrichardsz commented 5 months ago

Continue with attempts, if the wave buffer comes directly from microphone, it works

var mic = require("mic");
var micInstance = mic({
    rate: String(SAMPLE_RATE),
    channels: '1',
    debug: false,
    device: 'default',    
});

micInputStream.on('data', async (buffer) => {    
    if (rec.acceptWaveform(buffer)){

But if comes from socket, does not work.


Using this https://alanastorm.com/nodejs-inspecting-bytes-with-node-js-buffer-objects/ I'm trying to compare both buffers byte by byte t understand what is the difference.

In this csv I saved the compare: compare_bytes.csv

At first sight, the bytes from socket has a lot 00000

image


Consider that the wav buffer from socket can be stored as valid wav (pcm, 16bits, etc):

  @SocketIoEvent(eventName = "receive-audio")
  this.receiveAudio = async (message, currentSocket, globalSocket) => {

    var id = uuidv4();
    var wavLocation = `/tmp/${id}.wav`;
    await fs.promises.writeFile(wavLocation, message);

But the buffer from microphone, can not be saved as wav. Also if I read it using wavefile I got an error "Error: Not a supported format."


What part of wav is vosk (nodejs) expecting?

Thanks

nshmyrev commented 5 months ago

What part of wav is vosk (nodejs) expecting?

Only body

jrichardsz commented 5 months ago

According to the wave format, data is from 38 to 45 order

image

I tried but the rec.acceptWaveform(data) returns false

var data = message.slice(38,45);
rec.acceptWaveform(data)

Could you point me to some lectures to understand how to extract data from wav?

Thanks

nshmyrev commented 5 months ago

You should keep message as is, your slice doesn't make sense. Header is 44 bytes of first message only, and you can even keep it.

jrichardsz commented 5 months ago

Ok. I will keep the full wav file.

Comparing the wav from socket (does not works) vs from microphone(works in the sample), I found this:

From socket

@SocketIoEvent(eventName = "send-audio")
this.sendAudio = async (message, currentSocket, globalSocket) => {

    console.log("0  >  4 : "+message.slice(0,4).toString())
    console.log("8  > 12 : "+message.slice(8,12).toString())
    console.log("12 > 14 : "+message.slice(12,14).toString())
    console.log("36 > 40 : "+message.slice(36,40).toString())
    console.log("45 > end:", message.slice(45))

Output:

image

The output indicates that the buffer received as socket event is a valid wav file

From Microphone (vosk sample)

micInputStream.on('data', async (data) => {    
  if (rec.acceptWaveform(data)){

      console.log("0  >  4 : "+data.slice(0,4).toString())
      console.log("8  > 12 : "+data.slice(8,12).toString())
      console.log("12 > 14 : "+data.slice(12,14).toString())
      console.log("36 > 40 : "+data.slice(36,40).toString())
      console.log("45 > end:", data.slice(45))

      console.log("data:"+JSON.stringify(rec.result()));

Output

image

The received buffer from microphone (https://www.npmjs.com/package/mic) is not a valid wav file but works for vosk

jrichardsz commented 5 months ago

I don't know if it helps, but the object returned by the microphone (vosk sample) and the received from socket are Uint8Array

nshmyrev commented 5 months ago

Dump the data you receive both from microphone and socket to a file and share here please

jrichardsz commented 5 months ago

I will dump the data. In the mid time I prepared a reproducible sample

https://github.com/jrichardsz/nodejs-wav-vosk-transcription

As a summary:

Thank you very much for your kind help

jrichardsz commented 3 months ago

I tried the same but with other library and it works

https://github.com/solyarisoftware/voskJs