CODEJIN / AutoVC

MIT License
28 stars 7 forks source link

Are you still working on this repo? Need some help with my research. #2

Closed rishabhjain16 closed 3 years ago

rishabhjain16 commented 3 years ago

I am working on different VC methods. I was wondering if you are still working on this project? I can use some help for my research.

I also wanted to ask you that the speaker encoder you are using to generate speaker embeddings for AutoVC approach. Is it similar to there original speaker encoder(which they have not provided in their repo). My plan is to use this approach in a Multispeaker environment. So I might have to train the speaker encoder from scratch.

Also can you point me to the part where you are calculating the loss for F0 as mentioned in their paper (https://arxiv.org/pdf/2004.07370.pdf). I was also looking into calculation of Fundamental frequency and couldn't really figure out how they are calculating there loss (inputs for their loss).

Any help is appreciated. Thank you for your time. @CODEJIN

CODEJIN commented 3 years ago

Hi, rishabhjain16,

Thank you for contacting me. Unfortunately, this project has been stopped due to issues with the conversion quality. Since then, AutoVC's progress model, SpeechSplit, has been researched, but SpeechSplit is also in a state of stopping after only replication.

Anyway, first of all, in the case of speaker encoder, I think that I used the same algorithm as a method to obtain a speaker with a d-vector based on GE2E loss. However, in terms of performance, using a general lookup table rather than d-vector might be better in both AutoVC and SpeechSplit.

The F0 calculation is confusing for me too. The current repository uses the YIN algorithm, but other papers say that it is better to get F0 with RAPT. The recommended flow to get the F0 without speaker information is as follows.

  1. Extract F0 through YIN or RAPT.
  2. Apply a log to F0. In this case, some areas can be -inf.
  3. Do 1, 2 for all voices of a speaker.
  4. Calculate the average and standard deviation of using Log F0 values ​​of all voices of the speaker. In this case, it is calculated excluding -inf.
  5. Normalize all F0s of the speaker using the mean and standard deviation of 4.
  6. Convert -inf to a very low value (ex. -10).

This method is being used by Assem-VC (https://arxiv.org/abs/2104.00931), and in my opinion, it is a relatively specific and convincing process.

I hope this answer is helpful.

Best regards,

Heejo.

rishabhjain16 commented 3 years ago

Hi @CODEJIN ,

Thanks for your reply. Your response have been very helpful. Also the paper you mentioned, Assem-VC, seems to be quite SOA. There demo seems to be quite interesting.

Anyways, I will give that a try, the method you mentioned above. Thanks again for your help.

Kind Regards, Rishabh Jain