node.js Recognizer with SpeakerModel no X vector

damirn commented 3 years ago

It seems that Recognizer with SpeakerModel no longer produces X vector:

const fs = require('fs');
const wav = require('wav');
const { Model, Recognizer, SpeakerModel } = require('vosk');
const { Readable } = require('stream');

const model = new Model('model');
const spkModel = new SpeakerModel('model-spk');
const wfStream = fs.createReadStream('recording.wav', { highWaterMark: 4096 });
const wfReader = new wav.Reader();
const wfReadable = new Readable().wrap(wfReader);

wfReader.on('format', async ({ audioFormat, sampleRate, channels }) => {
  if (audioFormat != 1 || channels != 1) {
      console.error('Audio file must be WAV format mono PCM.');
      process.exit(1);
  }
  const rec = new Recognizer({ model, speakerModel: spkModel, sampleRate });
  for await (const data of wfReadable) {
      const endOfSpeech = await rec.acceptWaveform(data);
      if (endOfSpeech) {
          console.log(await rec.finalResult());
      } else {
          console.log(await rec.partialResult());
      }
  }
  console.log(await rec.finalResult());
  rec.free();
})
wfStream.pipe(wfReader);

returns as final:

{
  result: [ { conf: 0.403675, end: 0.66, start: 0.33, word: 'lol' } ],
  text: 'lol'
}

Am I missing something here?

sadrasabouri commented 3 years ago

Hi @damirn . I've tested your code after this fix by @nshmyrev and this just worked well for me. I wonder if there is a problem with your input file duration. I mean it's not long enough. Check the fallowing example:

sadra.wav which is a 5s voice from me :smile: (Don't worry about the transcriptions, it's a farsi voice)

LOG (VoskAPI:ReadDataFiles():model.cc:194) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:197) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.046181 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:221) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:251) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:273) Loading winfo model/graph/phones/word_boundary.int
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.relu is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.batchnorm is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.affine is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.log-softmax is never used to compute any output.
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 4 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 4 orphan components.
LOG (VoskAPI:Collapse():nnet-utils.cc:1488) Added 0 components, removed 4
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: 'hope' }
{ partial: 'hope' }
{ partial: 'hope solemnly' }
{ partial: 'hope solemnly' }
{ partial: 'hope salama' }
{ partial: 'hope salama' }
{ partial: 'hope solemnly affirming' }
{ partial: 'hope solemnly affirming' }
{ partial: 'hope salama' }
{ partial: 'hope salama' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for as you' }
{ partial: 'hope salama for as you' }
{ partial: 'hope salama for as you gonna' }
{ partial: 'hope salama for as you gonna' }
{ partial: "hope salama for as you're not hungry" }
{ partial: "hope salama for as you're not hungry" }
{ partial: 'hope salama for as you on a homegrown' }
{ partial: 'hope salama for as you on a homegrown' }
{ partial: 'hope salama for as your honor hundred and' }
{ partial: "hope salama for as you don't know how go on gonna go" }
{ partial: "hope salama for as you don't know how go on gonna go" }
{
partial: "hope salama for as you don't know how go on the regards to"
}
{
partial: "hope salama for as you don't know how go on the regards to"
}
{ partial: "hope salama for as you don't know how go on gonna go" }
{ partial: "hope salama for as you don't know how go on gonna go" }
{
partial: "hope salama for as you don't know how go on the regards to"
}
LOG (VoskAPI:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.0399 seconds taken in nnet3 compilation total (breakdown: 0.0323 compilation, 0.000912 optimization, 0 shortcut expansion, 0.000226 checking, 9.54e-07 computing indexes, 0.00648 misc.) + 0 I/O.
{
result: [
{ conf: 0.869462, end: 0.78, start: 0.48, word: 'hope' },
{ conf: 0.59985, end: 1.080237, start: 0.78, word: 'salon' },
{ conf: 0.28908, end: 1.537344, start: 1.080237, word: 'lover' },
{ conf: 0.5553, end: 1.92, start: 1.537344, word: 'for' },
{ conf: 0.997495, end: 2.85027, start: 2.67, word: 'as' },
{ conf: 0.692167, end: 2.958102, start: 2.85027, word: 'you' },
{ conf: 0.304366, end: 3.24, start: 3.077199, word: 'know' },
{ conf: 0.304366, end: 3.48, start: 3.24, word: 'how' },
{ conf: 0.215769, end: 3.75, start: 3.492269, word: 'go' },
{ conf: 0.711996, end: 3.981307, start: 3.801307, word: 'on' },
{ conf: 0.763168, end: 4.2, start: 3.99, word: 'gonna' },
{ conf: 0.763168, end: 4.35, start: 4.2, word: 'go' },
{ conf: 0.578562, end: 4.41, start: 4.35, word: 'a' },
{ conf: 0.649735, end: 4.598866, start: 4.41, word: 'little' },
{ conf: 0.614146, end: 5.1, start: 4.598866, word: 'purposes' }
],
spk: [
-0.068649, -0.940878,  0.711734,  0.249812, -0.910555, -0.320302,
  2.00664, -0.770135, -0.222062,  1.286207,  0.518486,  -0.13821,
-0.643341, -0.138837,  0.585025,  1.975515,  0.063531, -0.396259,
 0.296326, -2.507573,  1.975111,   0.48592,  1.285568,   0.85761,
  0.92632,  2.671811,  2.010355, -0.769467, -0.142065, -0.457111,
 1.122457, -0.013446, -0.039318,  0.000845,  0.755134,  0.221725,
 1.019091,   0.63321, -0.472993, -0.010439, -0.382082,  0.474217,
 2.271531, -0.845891,  0.552241, -0.963823,  0.594588,  0.728455,
  1.68869, -0.616291, -0.779519,  1.027915, -1.244267, -1.863834,
 0.673637, -1.371598, -0.970923,  0.424377, -0.730949,  1.398707,
-1.577528,  1.080564,  0.807472, -0.238997,  0.621041, -1.032762,
-1.547587,  0.967719,  0.094075,  0.372321,  0.347297, -0.606582,
-1.892523,  2.610352, -1.373739,   0.78393, -0.155127, -0.743056,
 1.586355,  1.908507,  0.802888,  0.212882, -0.192002, -0.142355,
 0.406887,  0.938945, -1.139643, -0.740027,  0.198139, -0.909051,
 0.828613, -0.377265, -0.188476, -0.522048, -0.824033, -1.551933,
 0.264458,  0.060576, -0.269225,  1.358158,
... 28 more items
],
spk_frames: 381,
text: 'hope salon lover for as you know how go on gonna go a little purposes'
}

You can see a extracted x-vector for this speaker in spk field.

sadra_1s.wav which is first second of my voice (sadra.wav)

LOG (VoskAPI:ReadDataFiles():model.cc:194) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:197) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.048708 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:221) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:251) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:273) Loading winfo model/graph/phones/word_boundary.int
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.relu is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.batchnorm is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.affine is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.log-softmax is never used to compute any output.
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 4 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 4 orphan components.
LOG (VoskAPI:Collapse():nnet-utils.cc:1488) Added 0 components, removed 4
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{
result: [
{ conf: 0.856251, end: 0.78, start: 0.48, word: 'hope' },
{ conf: 0.856251, end: 0.93, start: 0.78, word: 'so' }
],
text: 'hope so'
}

I think this is the case. You can try longer files and inform me with the result.

sadrasabouri commented 3 years ago

@nshmyrev Do you mind if I add some example (like python examples) to nodejs section in a PR? My plan is to have same examples on each platform.

nshmyrev commented 3 years ago

My plan is to have same examples on each platform.

Thank you, it would be great!

damirn commented 3 years ago

@sadrasabouri You're right, my wav file was truncated; once I replaced it with a longer one it worked for me too. I don't mind if add this as an example, I was thinking to do it myself.

sadrasabouri commented 3 years ago

@sadrasabouri You're right, my wav file was truncated; once I replaced it with a longer one it worked for me too. I don't mind if add this as an example, I was thinking to do it myself.

Thank you very much. Sure I'll this PR in my fork's nodejs_example branch.

Feel free to pull request me this example (I think test_speaker.js will be a suitable name) there so we can work on it together. Also you can compare python's example for additional features.

alphacep / vosk-api

node.js Recognizer with SpeakerModel no X vector #510