KimythAnly / AGAIN-VC

This is the official implementation of the paper AGAIN-VC: A One-shot Voice Conversion using Activation Guidance and Adaptive Instance Normalization.
https://kimythanly.github.io/AGAIN-VC-demo/index
MIT License
111 stars 19 forks source link

about train #10

Closed qianxixi908 closed 1 year ago

qianxixi908 commented 3 years ago

Hello, I tried to run the train but it showed the following :

raise ValueError("num_samples should be a positive integer " ValueError: num_samples should be a positive integer value, but got num_samples=0

could you give me some advice on how to solve it ? thanks.

KimythAnly commented 3 years ago

Hi, what dataset did you use?

qianxixi908 commented 3 years ago

Thank you for your reply. I used the default dataset ,vctk.

KimythAnly commented 3 years ago

Did you correctly preprocess the data? The folder structure is like

data/features/vctk/mel
├── p225_001.wav.npy
├── p225_002.wav.npy
├── p225_003.wav.npy
...
qianxixi908 commented 3 years ago

Thanks for your answer.

Yes, I have handled it as you said. I even used your original sample directly to see if the code can run, but the same error is still reported. And when debugging, the following problems were found:

“config. assert isinstance('dataset'(2083686355952), )object 'dataset' (2083686355952).'feat_path' (2083686356976)”SyntaxError: invalid syntax

KimythAnly commented 3 years ago

Hi, how about the file data/indexes/vctk/indexes.pkl? Its type is dict and contains two keys train and dev while the values are like p311_198.wav.npy.

qianxixi908 commented 3 years ago

Hello, thank you for your previous answer. Can you provide some details or codes about the speaker classifier mentioned in the experimental part of the article?

KimythAnly commented 3 years ago

Hi, here is the speaker classifier. It might be slightly different from the one we used in this ariticle due to some code refactoring, but the result should be similar. c_in: The dimension of the content/speaker embedding c_h: We set it to 128. c_out: The number of speakers, which is 80 in this paper.

import torch.nn as nn
from einops import rearrange

class Classifier(nn.Module):
    def __init__(self, c_in, c_h, c_out):
        super(Classifier, self).__init__()
        self.in_layer = nn.Linear(c_in, c_h)
        self.conv_relu_block = nn.Sequential(
            nn.Conv1d(c_h, c_h, 3),
            nn.ReLU(),
            nn.Conv1d(c_h, c_h, 3),
            nn.ReLU(),
            nn.Conv1d(c_h, c_h, 3),
            nn.ReLU(),
        )
        self.out_layer = nn.Linear(c_h, c_out)

    def forward(self, x):
        """
        x: (n, c, t)
        """
        x = rearrange(x, 'n c t -> n t c')
        y = self.in_layer(x)
        y = rearrange(y, 'n t c -> n c t')
        y = self.conv_relu_block(y)
        y = y.mean(-1)
        y = self.out_layer(y)
        return y