CODEJIN / SPEECHSPLIT

An implement of SPEECHSPLIT
https://codejin.github.io/SpeechSplit_Demo/index.html
MIT License
15 stars 2 forks source link

One-hot code speaker embedding and GE2E speaker embedding. #4

Open rishabhjain16 opened 3 years ago

rishabhjain16 commented 3 years ago

Hi,

Can you point me to the step where you are calculating the speaker embeddings. Is it possible to use this approach for my own dataset? I wanted to try SpeechSplit on my own voice dataset. But I am stuck with embedding calculation. I am not sure if I can use one-hot embedding for it or I need to calculate GE2E embeddings?

How does one-hot embedding contains the speaker information? Can you explain me how One-hot code speaker embedding and GE2E speaker embedding contains information for speaker identity. I have read the GE2E paper. But can't figure out how one-hot embeddings will have the similar impact as with GE2E embeddings.

Also, as mentioned in future work, were you able to implement GE2E Speaker embedding for SpeechSplit.

Thanks in advance.

CODEJIN commented 3 years ago

Hi, rishabhjain16,

Thank you for contacting me. Unfortunately, this project has been stopped due to issues with the conversion quality.

Currently implemented models do not use embedding. Therefore, a vector of the form 1000000... or 0100000... is entered directly. Of course, there is no information in this one-hot vector itself, but the weight of the bidirectional LSTM first calculated with the vector in the decoder becomes the actual speaker embedding.

Originally it was planned to create a GE2E base, but the plan was potentially discontinued due to the limited performance of the one-hot base. Of course, I think the SpeechSplit concept has important implications for voice conversion. The research team that published SpeechSplit said this model is 'in progress', so I am looking forward to more improved models.

Best regards,

Heejo

rishabhjain16 commented 3 years ago

Hi @CODEJIN,

Thank you for your response. And thank you for clarifying my doubts. SpeechSplit process seems to be quite interesting to me in the way they are extracting the individual information and combining them later for Voice Conversion. I guess I will also wait for the improved model as well. In the meantime also trying to figure out to use SpeechSplit with my own Dataset. It quite difficult to reproduce with my own data as information provided in their GitHub is quite limited and only for validation. So looking forward to their improved model as well.

Thanks for your help. Really appreciate it.

Kind Regards, Rishabh Jain