Valid wav file is returning an empty text as transcription

jrichardsz commented 5 months ago

Expected Behavior

Receive the wav file stream and get the transcription

Current Behavior

rec.result() is returning an empty text {"text":""}

if (rec.acceptWaveform(message)){      
  console.log(JSON.stringify(rec.result()));
}

Steps to Reproduce

I'm creating a valid wav file from web browser using javascript first obtaining a buffer then creating a blob
Send the wav file to the nodejs (socket-io server)
Get the Buffer <Buffer@0x6260d30 52 49 46 46 24 da 0e 00 57 41 5 ... and pass it to the vosk recognition instance

@SocketIoEvent(eventName = "send-audio")
this.sendAudio = async (message, currentSocket, globalSocket) => {
  var id = uuidv4();
  console.log(id);
  console.log(message);
  var wavLocation = `/tmp/${id}.wav`;
  await fs.promises.writeFile(wavLocation, message);
  console.log(wavLocation)

  //initRecognizer(); singleton
  if (rec.acceptWaveform(message)){      
    console.log(JSON.stringify(rec.result()));
  }else{
    console.log("not a wave format")
  }
}

Context (Environment)

Ubuntu 22
Firefox
Nodejs 16
Vosk npm library 0.3.39

Additional information

I uploaded the recorded wav file recorded here: https://filebin.net/e0dt62s5u0rg3ltj
Wav file details are

with browser

With ffmpeg

ffprobe -loglevel error -show_streams -i /tmp/19e2f4ce-bd22-4af2-8eac-180fbdd1cfc8.wav

index=0
codec_name=pcm_s16le
codec_long_name=PCM signed 16-bit little-endian
profile=unknown
codec_type=audio
codec_tag_string=[1][0][0][0]
codec_tag=0x0001
sample_fmt=s16
sample_rate=48000
channels=1
channel_layout=unknown
bits_per_sample=16
id=N/A
r_frame_rate=0/0
avg_frame_rate=0/0
time_base=1/48000
start_pts=N/A
start_time=N/A
duration_ts=173696
duration=3.618667
bit_rate=768000
max_bit_rate=N/A
bits_per_raw_sample=N/A
nb_frames=N/A
nb_read_frames=N/A
nb_read_packets=N/A
DISPOSITION:default=0
DISPOSITION:dub=0
DISPOSITION:original=0
DISPOSITION:comment=0
DISPOSITION:lyrics=0
DISPOSITION:karaoke=0
DISPOSITION:forced=0
DISPOSITION:hearing_impaired=0
DISPOSITION:visual_impaired=0
DISPOSITION:clean_effects=0
DISPOSITION:attached_pic=0
DISPOSITION:timed_thumbnails=0
[/STREAM]

With https://www.npmjs.com/package/wavefile

{
  container: 'RIFF',
  chunkSize: 347428,
  format: 'WAVE',
  Y: {
    chunkId: 'RIFF',
    chunkSize: 347428,
    format: 'WAVE',
    subChunks: [ [Object], [Object] ]
  },
  c: 36,
  a: { h: 32, o: false },
  Z: [ 'RIFF', 'RIFX', 'RF64' ],
  fmt: {
    chunkId: 'fmt ',
    chunkSize: 16,
    audioFormat: 1,
    numChannels: 1,
    sampleRate: 48000,
    byteRate: 96000,
    blockAlign: 2,
    bitsPerSample: 16,
    cbSize: 0,
    validBitsPerSample: 0,
    dwChannelMask: 0,
    subformat: []
  },
  fact: { chunkId: '', chunkSize: 0, dwSampleLength: 0 },
  cue: { chunkId: '', chunkSize: 0, dwCuePoints: 0, points: [] },
  smpl: {
    chunkId: '',
    chunkSize: 0,
    dwManufacturer: 0,
    dwProduct: 0,
    dwSamplePeriod: 0,
    dwMIDIUnityNote: 0,
    dwMIDIPitchFraction: 0,
    dwSMPTEFormat: 0,
    dwSMPTEOffset: 0,
    dwNumSampleLoops: 0,
    dwSamplerData: 0,
    loops: []
  },
  bext: {
    chunkId: '',
    chunkSize: 0,
    description: '',
    originator: '',
    originatorReference: '',
    originationDate: '',
    originationTime: '',
    timeReference: [ 0, 0 ],
    version: 0,
    UMID: '',
    loudnessValue: 0,
    loudnessRange: 0,
    maxTruePeakLevel: 0,
    maxMomentaryLoudness: 0,
    maxShortTermLoudness: 0,
    reserved: '',
    codingHistory: ''
  },
  iXML: { chunkId: '', chunkSize: 0, value: '' },
  ds64: {
    chunkId: '',
    chunkSize: 0,
    riffSizeHigh: 0,
    riffSizeLow: 0,
    dataSizeHigh: 0,
    dataSizeLow: 0,
    originationTime: 0,
    sampleCountHigh: 0,
    sampleCountLow: 0
  },
  data: {
    chunkId: 'data',
    chunkSize: 347392,
    samples: <Buffer@0x6da188c 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ... 347342 more bytes>
  },
  LIST: [],
  junk: { chunkId: '', chunkSize: 0, chunkData: [] },
  _PMX: { chunkId: '', chunkSize: 0, value: '' },
  g: { h: 16, o: false, O: false, R: false },
  bitDepth: '16',
  f: { h: 16, R: false, O: true, o: false },
  G: {
    '4': 17,
    '8': 1,
    '16': 1,
    '24': 1,
    '32': 1,
    '64': 3,
    '8a': 6,
    '8m': 7,
    '32f': 3
  }
}

Reading the docs of vosk the expected sound should be PCM 16-bit mono format and according the wav file complies

/**
  * Accept voice data
  *
  * accept and process new chunk of voice data
  *
  * @param {Buffer} data audio data in PCM 16-bit mono format
  * @returns {boolean} true if silence is occured and you can retrieve a new utterance with result method
  */
acceptWaveform(data: Buffer): boolean;

nshmyrev commented 5 months ago

demo Recognizer is initialized with 16khz sample rate, your file is 48khz, you probably need to check the line where you create the recognizer if it has proper sample rate.

jrichardsz commented 5 months ago

Thanks for your help.

I will review the sample rate and share the result.

jrichardsz commented 5 months ago

I changed the initial sample rate from 16000 to 48000

  var sampleRate = 48000
  var rec;

  initRecognizer=()=>{
    if(typeof rec !== 'undefined' ) return;
    vosk.setLogLevel(0);
    const model = new vosk.Model(this.configuration.vosk_model_path);
    rec = new vosk.Recognizer({model: model, sampleRate: sampleRate});  
  }

But now, rec.acceptWaveform(message) returned false. Just to try I returned to 16000 and at least rec.acceptWaveform(message) returns true but with empty result

Is there an option to set more setting params like bits_per_sample, bit_rate, etc ? Reviewing
Is it ok : PCM signed 16-bit little-endian ?
I will prepare a reproducible sample.

Thanks for your help

jrichardsz commented 5 months ago

Continue with attempts, if the wave buffer comes directly from microphone, it works

var mic = require("mic");
var micInstance = mic({
    rate: String(SAMPLE_RATE),
    channels: '1',
    debug: false,
    device: 'default',    
});

micInputStream.on('data', async (buffer) => {    
    if (rec.acceptWaveform(buffer)){

But if comes from socket, does not work.

Using this https://alanastorm.com/nodejs-inspecting-bytes-with-node-js-buffer-objects/ I'm trying to compare both buffers byte by byte t understand what is the difference.

In this csv I saved the compare: compare_bytes.csv

At first sight, the bytes from socket has a lot 00000

Consider that the wav buffer from socket can be stored as valid wav (pcm, 16bits, etc):

  @SocketIoEvent(eventName = "receive-audio")
  this.receiveAudio = async (message, currentSocket, globalSocket) => {

    var id = uuidv4();
    var wavLocation = `/tmp/${id}.wav`;
    await fs.promises.writeFile(wavLocation, message);

But the buffer from microphone, can not be saved as wav. Also if I read it using wavefile I got an error "Error: Not a supported format."

What part of wav is vosk (nodejs) expecting?

header + body
only body

Thanks

nshmyrev commented 5 months ago

What part of wav is vosk (nodejs) expecting?

Only body

jrichardsz commented 5 months ago

According to the wave format, data is from 38 to 45 order

I tried but the rec.acceptWaveform(data) returns false

var data = message.slice(38,45);
rec.acceptWaveform(data)

Could you point me to some lectures to understand how to extract data from wav?

Thanks

nshmyrev commented 5 months ago

You should keep message as is, your slice doesn't make sense. Header is 44 bytes of first message only, and you can even keep it.

jrichardsz commented 5 months ago

Ok. I will keep the full wav file.

Comparing the wav from socket (does not works) vs from microphone(works in the sample), I found this:

From socket

@SocketIoEvent(eventName = "send-audio")
this.sendAudio = async (message, currentSocket, globalSocket) => {

    console.log("0  >  4 : "+message.slice(0,4).toString())
    console.log("8  > 12 : "+message.slice(8,12).toString())
    console.log("12 > 14 : "+message.slice(12,14).toString())
    console.log("36 > 40 : "+message.slice(36,40).toString())
    console.log("45 > end:", message.slice(45))

Output:

The output indicates that the buffer received as socket event is a valid wav file

From Microphone (vosk sample)

micInputStream.on('data', async (data) => {    
  if (rec.acceptWaveform(data)){

      console.log("0  >  4 : "+data.slice(0,4).toString())
      console.log("8  > 12 : "+data.slice(8,12).toString())
      console.log("12 > 14 : "+data.slice(12,14).toString())
      console.log("36 > 40 : "+data.slice(36,40).toString())
      console.log("45 > end:", data.slice(45))

      console.log("data:"+JSON.stringify(rec.result()));

Output

The received buffer from microphone (https://www.npmjs.com/package/mic) is not a valid wav file but works for vosk

jrichardsz commented 5 months ago

I don't know if it helps, but the object returned by the microphone (vosk sample) and the received from socket are Uint8Array

nshmyrev commented 5 months ago

Dump the data you receive both from microphone and socket to a file and share here please

jrichardsz commented 5 months ago

I will dump the data. In the mid time I prepared a reproducible sample

https://github.com/jrichardsz/nodejs-wav-vosk-transcription

As a summary:

Vosk works very well with wav stream captured directly from native microphone
Vosk rec.acceptWaveform returns false with wav stream sent with socket from a native microphone
Vosk rec.acceptWaveform returns false with wav stream sent with socket from a web microphone
Vosk rec.acceptWaveform returns true with wav stream sent with socket from a web microphone and with stereo instead of mono

Thank you very much for your kind help

jrichardsz commented 3 months ago

I tried the same but with other library and it works

https://github.com/solyarisoftware/voskJs

alphacep / vosk-api