RF5 / simple-speaker-embedding

A speaker embedding network in Pytorch that is very quick to set up and use for whatever purposes.
Other
82 stars 11 forks source link

ConvGRU Design, Dataset Size #3

Closed kradonneoh closed 1 year ago

kradonneoh commented 1 year ago

Hey!

I had a few questions regarding the choices made when designing the ConvGRU network and getting your thoughts extensions to the dataset.

For the ConvGRU network, why did you decide to go with raw waveforms as opposed to Log Scale Mel Spectrograms (which often seems to be the first choice for convolutional style embedding networks)? Did you experiment with both and find the raw waveforms to be better? Also, did you ever try a fully convolutional approach / with a transformer or self-attention blocks?

In terms of the dataset, did you ever try multi-lingual data in addition to english data? I'm wondering if the addition of content that isn't english will help the model ignore content more than it does already

RF5 commented 1 year ago

Hi!

For your questions:

Hope that helps!

RF5 commented 1 year ago

Closing for now since this seems inactive, feel free to open if any more questions.

kradonneoh commented 1 year ago

Thanks for response! I did have a few more questions about training and implementation:

  1. How long did you train the ConvGRU model for and on what hardware -- do you think it could benefit from more iterations or did the validation loss plateu?
  2. If a user provides multiple utterances at inference time, is there better performance (by EER) by averaging the predicted embeddings, or not much compared to one-shot?

(I couldn't find a way to re-open the issue, so I'm hoping you'll still get a notification for this)

RF5 commented 1 year ago

Ahh sure thing:

  1. 700k updates (as in the checkpoint filename) on 1x 2070 SUPER gpu over a couple of weeks. I found the validation loss did more or less plateau at this stage, but it was not increasing, so it might still get a slight benefit from training longer?
  2. Typically yes, the speaker embedding is more stable if you average it over several utterances. However, I did not find this to be the case for all speakers, and it depends on the nature of each utterance (e.g. if one utterance is mostly shouting while the others are talking normally, the mean speaker embedding might be wonky). I didn't do much detailed measurements on one-shot vs few-shot averaging though.

Hope that helps!

RF5 commented 1 year ago

Closing again since it seems inactive, feel free to open if any more questions.