NZqian / SVCC2023-t23-ASLP

0 stars 0 forks source link

I had difficulty implementing a module called PBTC (Parallel Bank of Transposed Convlutions) #1

Closed yygg678 closed 1 year ago

yygg678 commented 1 year ago

I was trying to reproduce a voice conversion model using PyTorch and I had difficulty implementing a module called PBTC (Parallel Bank of Transposed Convlutions), As in the paper, it consists of several parallel branches with transposed convlution followed by a linear projection.

The input to this module is a sequence of f0 (fundamental frequency) embeddings with shape [B, T, L]. Basically we want it to pass through the PBTC module and get a new sequence with shape [B, T, F].

As shown in the figure, it first passes through a transposed convolution and get a sequence with shape [B, t', F]. Then it passes through a linear projection and get a sequence with shape [B, T, F].

But the thing is how to define such linear projection? As in the figure, t' is computed using t, dilation dil and kernel size k, where t should be the length of input sequence, T. But we don't exactly know what t is since the length of input sequence varies among different data point. So should I fix the input length, truncating or padding the raw input sequence to fit this length? But this does't seem to make sense...

My implementation of PBTC module is as follow:

`class PBTC(nn.Module): """ Parallel Bank of Transposed Convolutions Reference: https://www.isca-speech.org/archive/pdfs/interspeech_2020/webber20_interspeech.pdf https://arxiv.org/pdf/2303.12197.pdf """

def __init__(self,
             in_channels,
             out_channels,
             input_length,
             output_length,
             kernel_size=50,
             num_branches=10):
    super(PBTC, self).__init__()
    self.in_channels = in_channels
    self.out_channels = out_channels
    self.input_length = input_length
    self.output_length = output_length
    self.kernel_size = kernel_size
    self.num_branches = num_branches

    self.branches = nn.ModuleList()

    for dilation in range(1, 2 * num_branches, 2):
        # t' = (t-1) + dil * (k-1) + 1
        input_length_prime = (input_length - 1) + dilation * (self.kernel_size - 1) + 1
        self.branches.append(
            nn.Sequential(
                # ConvTranspose1D(num_filters, stride, dilation)
                nn.ConvTranspose1d(in_channels, out_channels, kernel_size=50, stride=1, dilation=dilation),
                nn.Linear(in_features=input_length_prime, out_features=output_length),
                nn.ReLU()
            )
        )

def forward(self, seq):
    # seq: [B, N, L]
    seq = seq.permute(0, 2, 1)  # [B, L, N]
    encoded_seq = torch.zeros((seq.size(0), self.out_channels, seq.size(-1)))  # B, F, N]
    for branch in self.branches:
        encoded_seq += branch(seq)
    encoded_seq = encoded_seq.permute(0, 2, 1)  # [B, N, F]

    return encoded_seq`

how did you mplementing the PBTC module?

NZqian commented 1 year ago

Thanks for your attention. The linear projection is done on the time axis, that is, the F0 compressed by the convolution layers are stretched back to their original length. Our implementation follows "SELF-SUPERVISED REPRESENTATIONS FOR SINGING VOICE CONVERSION". If you have any other question about our paper, you can contact me by using email or wechat email: ningziqian@mail.nwpu.edu.cn wechat: __NZQian