How to adopt a speaker verification task?

We concat two speech clips as input with a small piece of silence inserted in between. We use the prompt: "Do you only hear the same person talking? Answer yes or no." The model needs to determine whether the two speech clips are spoken by the same speaker.

However, I have to admit that the current model doesn't generalise well enough for speaker-related tasks, even for the simplest speaker verification task 😞. It seems that it can only recognise speakers in Voxceleb1 well, but not any other speakers.

bytedance / SALMONN

How to adopt a speaker verification task? #30