Does speaker recognition work without acoustic model?

Hi,

I've been doing some tests with the vosk small en model + speaker recognition and I think results are pretty solid, at least it was able to always find the correct speaker in a set of 7. It is a bit more complicated if you use it for true/false (is this speaker A?) tests but anyway I think it is an interesting feature 👍.

I'd like to use this for a variety of languages, but noticed that it does require a proper acoustic model to work. Now I could probably just keep using the English small model or any other and simply discard the resulting text, but I was wondering if this would lead to any issues with accuracy in other languages? And maybe there is a more efficient, generic acoustic model we could use for speaker recognition? I'm assuming the speaker recognition requires some VAD or feature extraction done by the acoustic model? At least it didn't work when I built a tiny grammar model with just tokens 😅.

alphacep / vosk-api

Does speaker recognition work without acoustic model? #1320