Plachtaa / seed-vc

State-of-the-Art zero-shot voice conversion & singing voice conversion with in context learning
GNU General Public License v3.0
662 stars 76 forks source link

Pitch Shift Issues with Sample Rate and Threshold Adjustments #26

Closed Drakni01 closed 3 weeks ago

Drakni01 commented 1 month ago

Hi!

I wanted to share some things I've been facing with the pitch shifting feature when f0_condition is set to True and auto_f0_adjust is set to False. When I set the pitch to zero, it seems to be around two semitones lower than the original pitch.

I tried changing the sample rate from sr=22050 to sr=24000 just in this part of the code in app.py:

waves_16k = torchaudio.functional.resample(waves_24k, sr, 16000) 
converted_waves_16k = torchaudio.functional.resample(converted_waves_24k, sr, 16000) 

to:

waves_16k = torchaudio.functional.resample(waves_24k, 24000, 16000) 
converted_waves_16k = torchaudio.functional.resample(converted_waves_24k, 24000, 16000) 

This made a noticeable difference. I also adjusted the threshold from 0.03 to 0.5 in these lines:

F0_ori = rmvpe.infer_from_audio(waves_16k[0], thred=0.03)
F0_alt = rmvpe.infer_from_audio(converted_waves_16k[0], thred=0.03)

to:

F0_ori = rmvpe.infer_from_audio(waves_16k[0], thred=0.5)
F0_alt = rmvpe.infer_from_audio(converted_waves_16k[0], thred=0.5)

With those changes, the pitch detection improved quite a bit, although it still wasn't perfect. I also tried using sr=25000 while keeping the threshold at 0.03, and it sounded much better than the first alternative.

However, I still encounter another issue. Even when I adjust the pitch to zero using either of the methods mentioned, there's a problem that I can't seem to compensate for when using pitch_shift values greater than +6 or less than -6. Some notes are transposed correctly, while others are not, and I'm not sure why. As a result, it becomes difficult to perform a complete scale transposition up or down, especially when using +12 or -12, as it sounds much more misaligned—some notes come through fine, but others do not.

I hope this feedback helps! Thanks for all your hard work on the project!

Plachtaa commented 1 month ago

Thanks for your feedback, I am facing some quality issues for F0 condition model and it is difficult to train a good one, but I will accept you changes as a temporary fix :)

Drakni01 commented 1 month ago

Hi! :] Thank you so much for considering my changes as a temporary fix.

I’ve been thinking about the issues with scaling up or down by more than 6 semitones after extracting the frequencies with RMVPE, and I’m planning to explore a few other possibilities. In theory, using a function like:

def adjust_f0_semitones(f0_sequence, n_semitones):
    factor = 2 ** (n_semitones / 12)
    return f0_sequence * factor

should work correctly since the mathematical basis behind it makes sense—it adjusts the frequency using a proper factor based on semitone shifts. Therefore, I can rule out that this function is the source of the problems during transposition, but I admit that I’m not entirely sure why the results don't align as expected when performing larger scale transpositions like +12 or -12 semitones.

I also tried changing the pitch directly on the source voice before using RMVPE to see if the issue could be related to handling very low or very high frequencies during extraction. Although this helped confirm that the extraction itself wasn’t the issue (as it handled those frequencies correctly), the fact remains that changing the pitch of the source is not a viable solution because it significantly affects the prosody.

This leads me to think that the problem might lie in how the transposed frequencies are processed later in the pipeline, perhaps at this step:

cond, _, codes, commitment_loss, codebook_loss = inference_module.length_regulator(S_alt, ylens=target_lengths, n_quantizers=3, f0=shifted_f0_alt)

Thanks again for using what I proposed as a temporary solution for the pitch_shift = 0 case. I hope that my observations from the experiments I conducted can help in resolving this issue. I understand that the challenges may stem from the model itself, but I wanted to explore whether the problem might lie elsewhere. Nonetheless, the results when transposing within the ±6 semitone range are impressive! 🎉

Drakni01 commented 1 month ago

I apologize for accidentally closing the issue. I accidentally pressed the wrong button; I'm still new to this. Thank you for your understanding!

Plachtaa commented 4 weeks ago

Please have a trial on the newly released F0 conditioned model, it should have better F0 following ability

Drakni01 commented 3 weeks ago

Hi!

I tried the newly released F0-conditioned model, and it works great! The pitch accuracy is spot-on, and it handles both an octave up and an octave down very well. The quality of the new model is impressive—thank you for resolving the issue! I think the matter can now be considered closed.

Thanks again! :]