b04901014 / UUVC

Official implementation for the paper: A Unified One-Shot Prosody and Speaker Conversion System with Self-Supervised Discrete Speech Units.
MIT License
73 stars 9 forks source link

Is there a way to increase the pitch range? #4

Closed dillfrescott closed 1 year ago

b04901014 commented 1 year ago

If you mean the acceptable pitch range of training the model, you can try to lower --f0_var_min and higher --f0_max.

If you mean to synthesize an utterance with higher pitch range, you can try to manipulate the predicted pitch directly (say, scale them by 1.2) during inference. While it is not obvious how to do so since we are working on frequency bins, we can recover it back to scalar and back to bins:

To manipulate the predicted f0 variance, you need to first recover the scalar value f0 with something like:

f0_preds, f0_feats = self.f0p(rv, unit)
f0 = torch.sigmoid(f0_preds)
v_bins = self.f0_bins.unsqueeze(0).unsqueeze(0).expand(f0_preds.size(0), f0_preds.size(1), -1).to(f0_preds.device)
f0 = (v_bins * (f0 / f0.sum(-1, keepdim=True))).sum(-1) #N, T, scaler F0

Then scale it with something like, or do other manipulation of f0 you want:

f0 = f0 * 1.2 #Or other manipulations

Finally map it back to frequency bins:

bins = self.f0_bins.unsqueeze(0).expand(f0.size(0), f0.size(1), -1).to(f0.device)
f0 = f0.unsqueeze(2).expand(-1, -1, bins.size(-1))
f0 = torch.exp(-(bins - f0) ** 2 / (2 * self.hp.f0_blur_sigma ** 2))

The rest is the same (feeding it to the synthesizer). I didn't test this before but hopefully it will work out.

dillfrescott commented 1 year ago

Thank you! Hopefully it will indeed work!

dillfrescott commented 1 year ago

It seems to be working! I increased the f0 ranges! Thank you so much!