Open nkcdy opened 5 years ago
num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16
num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16
Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform.
Another question is about the speaker embeddings. The speaker embedding in metadata.pkl is a scalar with 256-dimensions, but i got a matrix with the size of N*256 when I use the GE2E method to generate the speaker embeddings. What's the relationship between the scalar and the matrix?
The embedding in metadata.pkl should be a vector of length 256. The N you got might be the number of speakers.
Yes, the embedding is metadata.pkl is a vector of length 256. But I got several d-vector with length of 256 even if i use a single wave file(p225_001.wav). I did some normalizationc according to the GE2E paper( section 3.2), "the final utterance-wise d-vector is generated by L2 normalizing the window-wise d-vectors, then takking the element-wise averge". The result looks quite different from the vector in metadata.pkl. All number in the vector were positive number while the number in metadata.pkl has both positive and negative value. Should I just average the all the d-vector without normalization?
You can average all the d-vectors without normalization.
It didnt work... :(.
I noticed that the sampling rate of TIMIT corpus used in https://github.com/HarryVolek/PyTorch_Speaker_Verification is 16KHz while the sampling rate in VCTK corpus is 48kHz.
Should I re-train the D-vector network at the 48kHz sampling rate?
The details are described in the paper.
The details are described in the paper.
I still can not reproduce your reults as shown in the demo. what i got were babbles. The sampling rate of all the wavefiles has been changed to 16kHz as described in your paper.
The network I used to generate the speaker embeddings was Janghyun1230's version(https://github.com/Janghyun1230/Speaker_Verification).
I noticed that the method used to generate the Mel-spectrogram in wavenet is different with that in speaker verification. So I modified the source code of the speaker verificaiton to match the mel-spectrogram of wavenet and retrain the speaker embedding netowrk. But it still doesn't work for autovc conversion.
I guess the reason lies in the method to generate the speaker embeddings.
Can you give me some advice on that?
You are right. In this case, you have to retrain the model using your speaker embeddings.
num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16
Thanks a lot. the quality is improved with the above hyperparameters when i generate the mel spectrogram even if i use default parameter to generate waveform.
Do you clip the mel spectrogram to a specific range, such as [-1,1] or other case? Thanks!
Clip to [0,1]
Clip to [0,1]
How does mel spectrogram clip to [0,1] ? what algorithm or method do you used?
@auspicious3000 Can you please release your code to generate speaker embedding? I have the same question with @liveroomand that can't reproduce your embedding results, p225、p228、p256 and p270.Retraining the model costs a lot of time. Or please release all the parameters you set during training speaker embeddings. Thank you
@xw1324832579 You can use one-hot embedding if you are not doing zero-shot conversion. Retraining takes less than 12 hrs on single gpu.
@auspicious3000 Are the features(80-mel) of speaker embedding and text extraction(The encoder input) the same?
They don't have to be the same.
How to generate speaker mel spectrogram eg: num_mels: 40 fmin: 90 fmax: 7600 window_hight: 0.025s hop_hight: 0.01s don't Clip to [0,1]
@liveroomand Looks fine. You can refer to r9y9's wavenet vocoder for more details on spectrogram normalizaton and clipping
What you mean: the Speaker Encoder is pre-trained to use the Merle spectrum also needs to be Clip to [0, 1]
@liveroomand Yes in our case. but you can design your own speaker encoder or just use onehot embedding
hi all , can you help me please, i have my own dataset, how i process this data , how i can build my models to get my own wav audio.? thanks
num_mels: 80 fmin: 90 fmax: 7600 fft_size: 1024 hop_size: 256 min_level_db: -100 ref_level_db: 16
is this params suitable for other dataset? when i change to myself dataset ,it doesn't present good quilatity like vctk. is the reason you train wavenet-vocoder on vctk once again? could you give some advices on other dataset? thanks
@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature.
@miaoYuanyuan pleaze can you tell me what do you do to get you result , can you guide me what do you do ? i will thank you
@miaoYuanyuan For other dataset, you need to tune the parameters of the conversion model instead of the parameters of the feature.
thanks, you mean the wavenet-vocoder or auto-vc conversion model?
@miaoYuanyuan If you change the parameters of features, you will need to retrain the wavenet-vocoder as well.
Thank you! I got it.
@miaoYuanyuan pleaze can you tell me what do you do to get you result , can you guide me what do you do ? i will thank you
from wavs to mel spectrogram: Refer to the process of preprocess.py processing audio in the wavenet_vocoder folder to get the Mel spectrum you want. I haven't done voice conversion yet, so I couldn't give you advice.
@miaoYuanyuan thanx for you reply but if you can what do until now ,what is steps? until i could understand well, because iam confused with the details .
@miaoYuanyuan can you tell mw what the aim of preprocess.py file? can guide me to the start point >from where i should start? thanx
@miaoYuanyuan can you tell mw what the aim of preprocess.py file? can guide me to the start point >from where i should start? thanx this is the code . wish can help you. https://github.com/miaoYuanyuan/gen_melSpec_from_wav
Thanks @miaoYuanyuan for making the preprocessing steps clear! I wanted to experiment with AutoVC and the wavenet vocoder separately, and found this thread really useful. In the end I put my experiments in a notebook and made a git repo of it. It could be useful for those of you who are in the shoes of me-a-week-ago.
https://github.com/KnurpsBram/AutoVC_WavenetVocoder_GriffinLim_experiments
with the same wavenet model and the same utterence(p225_001.wav), i found that the quality of the waveform generated from the mel-spectrogram in provided metadata.pkl is much better than the one generated by myself. Is there any tricky on how to generate proper mel-spectrogram?