Closed yygg678 closed 1 year ago
Thanks for your attention. The linear projection is done on the time axis, that is, the F0 compressed by the convolution layers are stretched back to their original length. Our implementation follows "SELF-SUPERVISED REPRESENTATIONS FOR SINGING VOICE CONVERSION". If you have any other question about our paper, you can contact me by using email or wechat email: ningziqian@mail.nwpu.edu.cn wechat: __NZQian
I was trying to reproduce a voice conversion model using PyTorch and I had difficulty implementing a module called PBTC (Parallel Bank of Transposed Convlutions), As in the paper, it consists of several parallel branches with transposed convlution followed by a linear projection.
The input to this module is a sequence of f0 (fundamental frequency) embeddings with shape [B, T, L]. Basically we want it to pass through the PBTC module and get a new sequence with shape [B, T, F].
As shown in the figure, it first passes through a transposed convolution and get a sequence with shape [B, t', F]. Then it passes through a linear projection and get a sequence with shape [B, T, F].
But the thing is how to define such linear projection? As in the figure, t' is computed using t, dilation dil and kernel size k, where t should be the length of input sequence, T. But we don't exactly know what t is since the length of input sequence varies among different data point. So should I fix the input length, truncating or padding the raw input sequence to fit this length? But this does't seem to make sense...
My implementation of PBTC module is as follow:
`class PBTC(nn.Module): """ Parallel Bank of Transposed Convolutions Reference: https://www.isca-speech.org/archive/pdfs/interspeech_2020/webber20_interspeech.pdf https://arxiv.org/pdf/2303.12197.pdf """
how did you mplementing the PBTC module?