Closed p0p4k closed 9 months ago
also, cond = s + mu
means mu
and s
have to be in same dimensions and, s
is output of a mlp
with encoder_hidden
out dimensions; but input mu
and x
are of spec_dim
channels?
Ah, now i see the encoder_hidden and n_feats are 80 dims in the main tts model.
Oh yes, sorry for any confusion! There might be some implicit requirements for the tensor dimensions. I will take a careful look later.
Your repo helped me implement pflow tts. So thanks for that. Am I correct that diffsinger is a modified wavenet decoder?
In the diffsinger paper, they say "We adopt a non-causal WaveNet (Oord et al. 2016) architecture proposed by (Rethage, Pons, and Serra 2018; Kong et al. 2021) as our denoiser." So yes, it is a non-causal version of WaveNet instead of the original WaveNet. The reference here is "Rethage, D.; Pons, J.; and Serra, X. 2018. A wavenet for speech denoising. In 2018 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 5069–5073. IEEE."
This code does not work.
The following code works.