auspicious3000 / autovc

AutoVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss
https://arxiv.org/abs/1905.05879
MIT License
983 stars 207 forks source link

F0-consistent many-to-many non-parallel voice conversion via conditional autoencoder #50

Open tebin opened 4 years ago

tebin commented 4 years ago

I'm trying to improve the model by implementing the pitch conditioning introduced in https://arxiv.org/abs/2004.07370. However the process of producing normalized quantized log-F0 seems a bit confusing, as there are more than one way you could compute mean µ and std σ.

A sample's pitch vector is a 1d array whose size is n, where n is the number of frames (which seems to be fixed at 128 according to https://github.com/auspicious3000/autovc/issues/6#issuecomment-509202251). So there are three ways of computing µ and σ:

Suppose f0 is extracted from a sample audio of speaker A.

  1. Compute µ and σ of each individual sample on the fly (f0_norm = (f0 - f0.mean()) / f0.std() / 4).
  2. Compute µ and σ for each speaker (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an A x 128 array with A being the total number of samples of speaker A).
  3. Compute universal µ and σ of every sample (f0_norm = (f0 - f0s.mean()) / f0s.std() / 4 where f0s is an N x 128 array with N being the total number of all samples - that is, A < N).

And assuming the answer is 2 or 3, for unseen-to-seen or unseen-to-unseen conversion am I correct that µ and σ should be stored somewhere safe so I can reuse those values for inference? (I guess the option 2 doesn't really make sense since you can't compute those for unseen speakers)

auspicious3000 commented 4 years ago

The answer is 2. You will need µ and σ for inference. However, for unseen speakers, you can normalize using its own µ and σ, which is not a bad approximation.

tebin commented 4 years ago

@auspicious3000 Thanks for the response! As a followup question, could you confirm whether the following pipeline for data augmentation is correct?

Since we are now using randomly cropped audio segments, I suppose the previous requirement of 128 fixed-length frames no longer holds as long as they are multiples of freq=32, so we instead zero-pad segments to match the length of the longest segment in the batch.

My concern is mostly about the order in which augmentation steps are performed.

  1. Each segment has different factors For each audio in batch: Draw a random number L ~ U(1, 3) Split the audio into (audio length/L) segments <-- not sure what to do if the final segment is shorter than L? For each segment: Compress or stretch the segment using a factor of 0.7 - 1.35 Change the signal power between 10% and 100%

  2. Segments from the same audio share the same factors: For each audio in batch: Compress or stretch the audio using a factor of 0.7 - 1.35 Change the signal power between 10% and 100% Draw a random number L ~ U(1, 3) Split the audio into (audio length/L) segments <-- not sure what to do if the final segment is shorter than L?

auspicious3000 commented 4 years ago

There is no need to split the audio. The post-processing length is the same within the batch. Just index from the spectrogram. For example, [0, 0.5,1, 1.5] and [0, 2, 4, 6] are two instances in the same batch with length=4, where the former is stretched and the latter is compressed.

Miralan commented 4 years ago

Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.

auspicious3000 commented 4 years ago

@Miralan Yes. If you have N speakers, just use N-dimensional one-hot embedding.

Miralan commented 4 years ago

@Miralan Yes. If you have N speakers, just use N-dimensional one-hot embedding.

So if I do time-stretched and compressed with mel spectrogram. Should I do the same time-streched and compression for the fundamental frequency sequence?

auspicious3000 commented 4 years ago

Yes

Trebolium commented 3 years ago

Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.

Miralan, Very impressive to hear your doing MOS experiments with F0 information applied. Is there anywhere I can hear your generations? Would love to speak more about this!

Miralan commented 3 years ago

Recently, I was trying to improve origin autovc by using F0 information. Using 256-dimensional one-hot vectors in the original autovc seems to perform well. But in the process of improvement, I found that using a 256-dimensional one-hot vector seems to get a very low MOS score for speech. I want to know whether one-hot vector can be used in the improvement work based on F0 if zero-shot conversion is not done.

Miralan, Very impressive to hear your doing MOS experiments with F0 information applied. Is there anywhere I can hear your generations? Would love to speak more about this!

OK, I had done this expriment for long time, so I cann't find the results. But I have tried concante normalized f0s, it didn't work well, such as some content lack of wav. Maybe you can try crepe to extract f0s.