Walleclipse / Deep_Speaker-speaker_recognition_system

Keras implementation of ‘’Deep Speaker: an End-to-End Neural Speaker Embedding System‘’ (speaker recognition)
246 stars 81 forks source link

voice sample length #20

Closed mangushev closed 4 years ago

mangushev commented 5 years ago

Hi,

It seems that 1.6 seconds is quite short. I see in papers they use 3 or 5 seconds or ever longer. But increasing let's say 2 times to 3.2 seconds will results in 320 frames. If using convolutional model. it will result averaging 20 embeddings at the end instead of 10. It feels that this averaging is not the best thing. To avoid this, extending network will double embedding to 1024 from 512 as I see.

Please let me know your views.

Walleclipse commented 5 years ago

Hi, The voice sample length really depends on the data set. In some data set the valid voice length (except silence), is less than 3 seconds. If the valid voice length long enough, I think your suggests are really valuable. But I am not sure which is better. On the one hand, in the speaker embedding task, we first embed the utterance of the speaker and "summarize (averaging in this paper)" it into speaker level. So it is reasonable to use more utterance embedding and "summarize" into speaker level. That will more robust for some abnormal utterance, and get a more statistical result. In the other hand, double the embedding can store more information. That may represent more colorful utterance for speakers. Whick can avoid the missing some information. I am apologize to I am not evaluating that two methods. Maybe you can design some experiment of them, it is interesting. Thanks for your suggest!