does the `cond` need to be projected to same channels in DiffSingerNet?

X-LANCE / VoiceFlow-TTS

[ICASSP 2024] This is the official code for "VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching"

https://cantabile-kwok.github.io/VoiceFlow/

276 stars 20 forks source link

does the `cond` need to be projected to same channels in DiffSingerNet? #6

Closed p0p4k closed 9 months ago

p0p4k commented 9 months ago

This code does not work.

diffnet = DiffSingerNet()

x = torch.randn(2, 80, 10)
x_mask = torch.ones(2, 1, 10)
t = torch.tensor([1])
mu = torch.randn(2, 80, 10)

diffnet(x, x_mask, mu, t)

The following code works.

diffnet = DiffSingerNet()

x = torch.randn(2, 80, 10)
x_mask = torch.ones(2, 1, 10)
t = torch.tensor([1])
mu = torch.randn(2, 128, 10)

diffnet(x, x_mask, mu, t)

p0p4k commented 9 months ago

also, cond = s + mu means mu and s have to be in same dimensions and, s is output of a mlp with encoder_hidden out dimensions; but input mu and x are of spec_dim channels?

p0p4k commented 9 months ago

Ah, now i see the encoder_hidden and n_feats are 80 dims in the main tts model.

cantabile-kwok commented 9 months ago

Oh yes, sorry for any confusion! There might be some implicit requirements for the tensor dimensions. I will take a careful look later.

p0p4k commented 9 months ago

Your repo helped me implement pflow tts. So thanks for that. Am I correct that diffsinger is a modified wavenet decoder?

cantabile-kwok commented 9 months ago

In the diffsinger paper, they say "We adopt a non-causal WaveNet (Oord et al. 2016) architecture proposed by (Rethage, Pons, and Serra 2018; Kong et al. 2021) as our denoiser." So yes, it is a non-causal version of WaveNet instead of the original WaveNet. The reference here is "Rethage, D.; Pons, J.; and Serra, X. 2018. A wavenet for speech denoising. In 2018 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 5069–5073. IEEE."