Closed Drakni01 closed 3 weeks ago
Thanks for your feedback, I am facing some quality issues for F0 condition model and it is difficult to train a good one, but I will accept you changes as a temporary fix :)
Hi! :] Thank you so much for considering my changes as a temporary fix.
I’ve been thinking about the issues with scaling up or down by more than 6 semitones after extracting the frequencies with RMVPE, and I’m planning to explore a few other possibilities. In theory, using a function like:
def adjust_f0_semitones(f0_sequence, n_semitones):
factor = 2 ** (n_semitones / 12)
return f0_sequence * factor
should work correctly since the mathematical basis behind it makes sense—it adjusts the frequency using a proper factor based on semitone shifts. Therefore, I can rule out that this function is the source of the problems during transposition, but I admit that I’m not entirely sure why the results don't align as expected when performing larger scale transpositions like +12 or -12 semitones.
I also tried changing the pitch directly on the source voice before using RMVPE to see if the issue could be related to handling very low or very high frequencies during extraction. Although this helped confirm that the extraction itself wasn’t the issue (as it handled those frequencies correctly), the fact remains that changing the pitch of the source is not a viable solution because it significantly affects the prosody.
This leads me to think that the problem might lie in how the transposed frequencies are processed later in the pipeline, perhaps at this step:
cond, _, codes, commitment_loss, codebook_loss = inference_module.length_regulator(S_alt, ylens=target_lengths, n_quantizers=3, f0=shifted_f0_alt)
Thanks again for using what I proposed as a temporary solution for the pitch_shift = 0
case. I hope that my observations from the experiments I conducted can help in resolving this issue.
I understand that the challenges may stem from the model itself, but I wanted to explore whether the problem might lie elsewhere. Nonetheless, the results when transposing within the ±6 semitone range are impressive! 🎉
I apologize for accidentally closing the issue. I accidentally pressed the wrong button; I'm still new to this. Thank you for your understanding!
Please have a trial on the newly released F0 conditioned model, it should have better F0 following ability
Hi!
I tried the newly released F0-conditioned model, and it works great! The pitch accuracy is spot-on, and it handles both an octave up and an octave down very well. The quality of the new model is impressive—thank you for resolving the issue! I think the matter can now be considered closed.
Thanks again! :]
Hi!
I wanted to share some things I've been facing with the pitch shifting feature when
f0_condition
is set to True andauto_f0_adjust
is set to False. When I set the pitch to zero, it seems to be around two semitones lower than the original pitch.I tried changing the sample rate from
sr=22050
tosr=24000
just in this part of the code inapp.py
:to:
This made a noticeable difference. I also adjusted the threshold from
0.03
to0.5
in these lines:to:
With those changes, the pitch detection improved quite a bit, although it still wasn't perfect. I also tried using
sr=25000
while keeping the threshold at0.03
, and it sounded much better than the first alternative.However, I still encounter another issue. Even when I adjust the pitch to zero using either of the methods mentioned, there's a problem that I can't seem to compensate for when using
pitch_shift
values greater than +6 or less than -6. Some notes are transposed correctly, while others are not, and I'm not sure why. As a result, it becomes difficult to perform a complete scale transposition up or down, especially when using +12 or -12, as it sounds much more misaligned—some notes come through fine, but others do not.I hope this feedback helps! Thanks for all your hard work on the project!