auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
1.01k stars 206 forks source link

How to generate mel spectrogram #4

Open nkcdy opened 5 years ago

nkcdy commented 5 years ago

with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?

auspicious3000 commented 5 years ago

num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16

nkcdy commented 5 years ago

num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16

Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform.

nkcdy commented 5 years ago

Another question is about the speaker embeddings. The speaker embedding in metadata.pkl is a scalar with 256-dimensions, but i got a matrix with the size of N*256 when I use the GE2E method to generate the speaker embeddings. What's the relationship between the scalar and the matrix?

auspicious3000 commented 5 years ago

The embedding in metadata.pkl should be a vector of length 256. The N you got might be the number of speakers.

nkcdy commented 5 years ago

Yes, the embedding is metadata.pkl is a vector of length 256. But I got several d-vector with length of 256 even if i use a single wave file(p225_001.wav). I did some normalizationc according to the GE2E paper( section 3.2), "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then takking the element-wise averge". The result looks quite different from the vector in metadata.pkl. All number in the vector were positive number while the number in metadata.pkl has both positive and negative value. Should I just average the all the d-vector without normalization?

auspicious3000 commented 5 years ago

You can average all the d-vectors without normalization.

nkcdy commented 5 years ago

It didnt work... :(.

I noticed that the sampling rate of TIMIT corpus used in https://github.com/HarryVolek/PyTorch_Speaker_Verification is 16KHz while the sampling rate in VCTK corpus is 48kHz.

Should I re-train the D-vector network at the 48kHz sampling rate?

auspicious3000 commented 5 years ago

The details are described in the paper.

nkcdy commented 5 years ago

The details are described in the paper.

I still can not reproduce your reults as shown in the demo. what i got were babbles. The sampling rate of all the wavefiles has been changed to 16kHz as described in your paper.

The network I used to generate the speaker embeddings was Janghyun1230's version(https://github.com/Janghyun1230/Speaker_Verification).

I noticed that the method used to generate the Mel-spectrogram in wavenet is different with that in speaker verification. So I modified the source code of the speaker verificaiton to match the mel-spectrogram of wavenet and retrain the speaker embedding netowrk. But it still doesn't work for autovc conversion.

I guess the reason lies in the method to generate the speaker embeddings.

Can you give me some advice on that?

auspicious3000 commented 5 years ago

You are right. In this case, you have to retrain the model using your speaker embeddings.

lhppom commented 5 years ago

num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16

Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform.

Do you clip the mel spectrogram to a specific range, such as [-1,1] or other case? Thanks!

auspicious3000 commented 5 years ago

Clip to [0,1]

liveroomand commented 5 years ago
Clip to [0,1]

How does mel spectrogram clip to [0,1] ? what algorithm or method do you used?

xw1324832579 commented 5 years ago

@auspicious3000 Can you please release your code to generate speaker embedding? I have the same question with @liveroomand that can't reproduce your embedding results, p225、p228、p256 and p270.Retraining the model costs a lot of time. Or please release all the parameters you set during training speaker embeddings. Thank you

auspicious3000 commented 5 years ago

@xw1324832579 You can use one-hot embedding if you are not doing zero-shot conversion. Retraining takes less than 12 hrs on single gpu.

liveroomand commented 5 years ago

@auspicious3000 Are the features(80-mel) of speaker embedding and text extraction(The encoder input) the same?

auspicious3000 commented 5 years ago

They don't have to be the same.

liveroomand commented 5 years ago

How to generate speaker mel spectrogram eg: num_mels: 40 fmin: 90 fmax: 7600 window_hight: 0.025s hop_hight: 0.01s don't Clip to [0,1]

auspicious3000 commented 5 years ago

@liveroomand Looks fine. You can refer to r9y9's wavenet vocoder for more details on spectrogram normalizaton and clipping

liveroomand commented 5 years ago

What you mean: the Speaker Encoder is pre-trained to use the Merle spectrum also needs to be Clip to [0, 1]

auspicious3000 commented 5 years ago

@liveroomand Yes in our case. but you can design your own speaker encoder or just use onehot embedding

smalissa commented 5 years ago

hi all , can you help me please, i have my own dataset, how i process this data , how i can build my models to get my own wav audio.? thanks

miaoYuanyuan commented 5 years ago

num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16

is this params suitable for other dataset? when i change to myself dataset ,it doesn't present good quilatity like vctk. is the reason you train wavenet-vocoder on vctk once again? could you give some advices on other dataset? thanks

auspicious3000 commented 5 years ago

@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature.

smalissa commented 5 years ago

@miaoYuanyuan pleaze can you tell me what do you do to get you result , can you guide me what do you do ? i will thank you

miaoYuanyuan commented 5 years ago

@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature.

thanks, you mean the wavenet-vocoder or auto-vc conversion model?

auspicious3000 commented 5 years ago

@miaoYuanyuan If you change the parameters of features, you will need to retrain the wavenet-vocoder as well.

miaoYuanyuan commented 5 years ago

Thank you! I got it.

miaoYuanyuan commented 5 years ago

@miaoYuanyuan pleaze can you tell me what do you do to get you result , can you guide me what do you do ? i will thank you

from wavs to mel spectrogram: Refer to the process of preprocess.py processing audio in the wavenet_vocoder folder to get the Mel spectrum you want. I haven't done voice conversion yet, so I couldn't give you advice.

smalissa commented 5 years ago

@miaoYuanyuan thanx for you reply but if you can what do until now ,what is steps? until i could understand well, because iam confused with the details .

smalissa commented 5 years ago

@miaoYuanyuan can you tell mw what the aim of preprocess.py file? can guide me to the start point >from where i should start? thanx

miaoYuanyuan commented 5 years ago

@miaoYuanyuan can you tell mw what the aim of preprocess.py file? can guide me to the start point >from where i should start? thanx this is the code . wish can help you. https://github.com/miaoYuanyuan/gen_melSpec_from_wav

KnurpsBram commented 4 years ago

Thanks @miaoYuanyuan for making the preprocessing steps clear! I wanted to experiment with AutoVC and the wavenet vocoder separately, and found this thread really useful. In the end I put my experiments in a notebook and made a git repo of it. It could be useful for those of you who are in the shoes of me-a-week-ago.

https://github.com/KnurpsBram/AutoVC_WavenetVocoder_GriffinLim_experiments