Closed damirn closed 3 years ago
Hi @damirn . I've tested your code after this fix by @nshmyrev and this just worked well for me. I wonder if there is a problem with your input file duration. I mean it's not long enough. Check the fallowing example:
sadra.wav
which is a 5s voice from me :smile: (Don't worry about the transcriptions, it's a farsi voice)
LOG (VoskAPI:ReadDataFiles():model.cc:194) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:197) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.046181 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:221) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:251) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:273) Loading winfo model/graph/phones/word_boundary.int
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.relu is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.batchnorm is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.affine is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.log-softmax is never used to compute any output.
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 4 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 4 orphan components.
LOG (VoskAPI:Collapse():nnet-utils.cc:1488) Added 0 components, removed 4
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: 'hope' }
{ partial: 'hope' }
{ partial: 'hope solemnly' }
{ partial: 'hope solemnly' }
{ partial: 'hope salama' }
{ partial: 'hope salama' }
{ partial: 'hope solemnly affirming' }
{ partial: 'hope solemnly affirming' }
{ partial: 'hope salama' }
{ partial: 'hope salama' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for' }
{ partial: 'hope salama for as you' }
{ partial: 'hope salama for as you' }
{ partial: 'hope salama for as you gonna' }
{ partial: 'hope salama for as you gonna' }
{ partial: "hope salama for as you're not hungry" }
{ partial: "hope salama for as you're not hungry" }
{ partial: 'hope salama for as you on a homegrown' }
{ partial: 'hope salama for as you on a homegrown' }
{ partial: 'hope salama for as your honor hundred and' }
{ partial: "hope salama for as you don't know how go on gonna go" }
{ partial: "hope salama for as you don't know how go on gonna go" }
{
partial: "hope salama for as you don't know how go on the regards to"
}
{
partial: "hope salama for as you don't know how go on the regards to"
}
{ partial: "hope salama for as you don't know how go on gonna go" }
{ partial: "hope salama for as you don't know how go on gonna go" }
{
partial: "hope salama for as you don't know how go on the regards to"
}
LOG (VoskAPI:~CachingOptimizingCompiler():nnet-optimize.cc:710) 0.0399 seconds taken in nnet3 compilation total (breakdown: 0.0323 compilation, 0.000912 optimization, 0 shortcut expansion, 0.000226 checking, 9.54e-07 computing indexes, 0.00648 misc.) + 0 I/O.
{
result: [
{ conf: 0.869462, end: 0.78, start: 0.48, word: 'hope' },
{ conf: 0.59985, end: 1.080237, start: 0.78, word: 'salon' },
{ conf: 0.28908, end: 1.537344, start: 1.080237, word: 'lover' },
{ conf: 0.5553, end: 1.92, start: 1.537344, word: 'for' },
{ conf: 0.997495, end: 2.85027, start: 2.67, word: 'as' },
{ conf: 0.692167, end: 2.958102, start: 2.85027, word: 'you' },
{ conf: 0.304366, end: 3.24, start: 3.077199, word: 'know' },
{ conf: 0.304366, end: 3.48, start: 3.24, word: 'how' },
{ conf: 0.215769, end: 3.75, start: 3.492269, word: 'go' },
{ conf: 0.711996, end: 3.981307, start: 3.801307, word: 'on' },
{ conf: 0.763168, end: 4.2, start: 3.99, word: 'gonna' },
{ conf: 0.763168, end: 4.35, start: 4.2, word: 'go' },
{ conf: 0.578562, end: 4.41, start: 4.35, word: 'a' },
{ conf: 0.649735, end: 4.598866, start: 4.41, word: 'little' },
{ conf: 0.614146, end: 5.1, start: 4.598866, word: 'purposes' }
],
spk: [
-0.068649, -0.940878, 0.711734, 0.249812, -0.910555, -0.320302,
2.00664, -0.770135, -0.222062, 1.286207, 0.518486, -0.13821,
-0.643341, -0.138837, 0.585025, 1.975515, 0.063531, -0.396259,
0.296326, -2.507573, 1.975111, 0.48592, 1.285568, 0.85761,
0.92632, 2.671811, 2.010355, -0.769467, -0.142065, -0.457111,
1.122457, -0.013446, -0.039318, 0.000845, 0.755134, 0.221725,
1.019091, 0.63321, -0.472993, -0.010439, -0.382082, 0.474217,
2.271531, -0.845891, 0.552241, -0.963823, 0.594588, 0.728455,
1.68869, -0.616291, -0.779519, 1.027915, -1.244267, -1.863834,
0.673637, -1.371598, -0.970923, 0.424377, -0.730949, 1.398707,
-1.577528, 1.080564, 0.807472, -0.238997, 0.621041, -1.032762,
-1.547587, 0.967719, 0.094075, 0.372321, 0.347297, -0.606582,
-1.892523, 2.610352, -1.373739, 0.78393, -0.155127, -0.743056,
1.586355, 1.908507, 0.802888, 0.212882, -0.192002, -0.142355,
0.406887, 0.938945, -1.139643, -0.740027, 0.198139, -0.909051,
0.828613, -0.377265, -0.188476, -0.522048, -0.824033, -1.551933,
0.264458, 0.060576, -0.269225, 1.358158,
... 28 more items
],
spk_frames: 381,
text: 'hope salon lover for as you know how go on gonna go a little purposes'
}
You can see a extracted x-vector for this speaker in spk
field.
sadra_1s.wav
which is first second of my voice (sadra.wav
)
LOG (VoskAPI:ReadDataFiles():model.cc:194) Decoding params beam=10 max-active=3000 lattice-beam=2
LOG (VoskAPI:ReadDataFiles():model.cc:197) Silence phones 1:2:3:4:5:6:7:8:9:10
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 0 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 0 orphan components.
LOG (VoskAPI:CompileLooped():nnet-compile-looped.cc:345) Spent 0.048708 seconds in looped compilation.
LOG (VoskAPI:ReadDataFiles():model.cc:221) Loading i-vector extractor from model/ivector/final.ie
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:183) Computing derived variables for iVector extractor
LOG (VoskAPI:ComputeDerivedVars():ivector-extractor.cc:204) Done.
LOG (VoskAPI:ReadDataFiles():model.cc:251) Loading HCL and G from model/graph/HCLr.fst model/graph/Gr.fst
LOG (VoskAPI:ReadDataFiles():model.cc:273) Loading winfo model/graph/phones/word_boundary.int
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.relu is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node tdnn6.batchnorm is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.affine is never used to compute any output.
WARNING (VoskAPI:Check():nnet-nnet.cc:789) Node output.log-softmax is never used to compute any output.
LOG (VoskAPI:RemoveOrphanNodes():nnet-nnet.cc:948) Removed 4 orphan nodes.
LOG (VoskAPI:RemoveOrphanComponents():nnet-nnet.cc:847) Removing 4 orphan components.
LOG (VoskAPI:Collapse():nnet-utils.cc:1488) Added 0 components, removed 4
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{ partial: '' }
{
result: [
{ conf: 0.856251, end: 0.78, start: 0.48, word: 'hope' },
{ conf: 0.856251, end: 0.93, start: 0.78, word: 'so' }
],
text: 'hope so'
}
I think this is the case. You can try longer files and inform me with the result.
@nshmyrev Do you mind if I add some example (like python examples) to nodejs section in a PR? My plan is to have same examples on each platform.
My plan is to have same examples on each platform.
Thank you, it would be great!
@sadrasabouri You're right, my wav file was truncated; once I replaced it with a longer one it worked for me too. I don't mind if add this as an example, I was thinking to do it myself.
@sadrasabouri You're right, my wav file was truncated; once I replaced it with a longer one it worked for me too. I don't mind if add this as an example, I was thinking to do it myself.
Thank you very much. Sure I'll this PR in my fork's nodejs_example
branch.
Feel free to pull request me this example (I think test_speaker.js
will be a suitable name) there so we can work on it together.
Also you can compare python's example for additional features.
It seems that Recognizer with SpeakerModel no longer produces X vector:
returns as final:
Am I missing something here?