auspicious3000 / SpeechSplit

Unsupervised Speech Decomposition Via Triple Information Bottleneck
http://arxiv.org/abs/2004.11284
MIT License
636 stars 92 forks source link

How to build the validation data? #62

Open AShoydokova opened 2 years ago

AShoydokova commented 2 years ago

Hello,

thank you so much for the code and paper! I'm trying to train the model on speech command data. I've made the train and validation data sets through 2 scripts: make_spect_f0.py and make_metadat.py, but the model fails on the validation step, on this line : x_identic_val = self.G(x_f0, x_real_pad, emb_org_val)

The error is: RuntimeError: The expanded size of the tensor (192) must match the existing size (1085) at non-singleton dimension 1. Target sizes: [-1, 192, -1]. Tensor sizes: [1085, 1].

I'm not sure why there is a mismatch as self.G worked. Although there is the "G identity mapping loss" step which preprocess the input before feeding to self.G. Do I need to do the same with the validation data? Also 192 is the max_len_pad = 192, while 1085 is the number of the speakers (dim_spk_emb = 1085). Do I need to change the max_len_pad?

I'll appreciate for any help or direction!

My hparams.py is below

hparams = HParams(
    # model   
    freq = 8,
    dim_neck = 8,
    freq_2 = 8,
    dim_neck_2 = 1,
    freq_3 = 8,
    dim_neck_3 = 32,
    out_channels = 10 * 3,
    layers = 24,
    stacks = 4,
    residual_channels = 512,
    gate_channels = 512,  # split into 2 groups internally for gated activation
    skip_out_channels = 256,
    cin_channels = 80,
    gin_channels = -1,  # i.e., speaker embedding dim
    weight_normalization = True,
    n_speakers = -1,
    dropout = 1 - 0.95,
    kernel_size = 3,
    upsample_conditional_features = True,
    upsample_scales = [4, 4, 4, 4],
    freq_axis_kernel_size = 3,
    legacy = True,

    dim_enc = 512,
    dim_enc_2 = 128,
    dim_enc_3 = 256,

    dim_freq = 80,
    dim_spk_emb = 1085,
    dim_f0 = 257,
    dim_dec = 512,
    len_raw = 128,
    chs_grp = 16,

    # interp
    min_len_seg = 19,
    max_len_seg = 32,
    # min_len_seq = 64,
    min_len_seq = 0,
    # max_len_seq = 128,
    max_len_seq = 10,
    max_len_pad = 192,

    # data loader
    root_dir = 'assets/spmel',
    feat_dir = 'assets/raptf0',
    batch_size = 16,
    mode = 'train',
    shuffle = True,
    num_workers = 0,
    samplier = 8,

    # Convenient model builder
    builder = "wavenet",

    hop_size = 256,
    log_scale_min = float(-32.23619130191664),

)
auspicious3000 commented 2 years ago

What is the "G identity mapping loss" step? I guess one of the tensors needs to be transposed because dim and length mean different things.

AShoydokova commented 2 years ago

Thank you so much for a quick response! Let me play around with it. The training part is working, but evaluation part is failing.

The G identity mapping loss step is this part code that does something with the train data in Solver.train method:

# G identity mapping loss
            x_f0 = torch.cat((x_real_org, f0_org), dim=-1)
            x_f0_intrp = self.Interp(x_f0, len_org) 
            f0_org_intrp = quantize_f0_torch(x_f0_intrp[:,:,-1])[0]
            x_f0_intrp_org = torch.cat((x_f0_intrp[:,:,:-1], f0_org_intrp), dim=-1)
AShoydokova commented 2 years ago

I've fixed my issue. The thing was I was creating the speaker embeddings as a 1 dimensional array, while the model expected 2 dimensional. So I have 1085 speakers and for each speaker I created one-hot encoding vector of the size [1085]. While the model expected a vector of the size [1, 1085].

Thank you again for your help!

9527950 commented 1 year ago

I have trained the network with extremely poor results. I would like to ask how your validation set is set up? I used directly the demo.pkl file used in the code and found that the loss goes up. Also the hyperparameters given in the source code don't seem to be the same as yours, for example: dim_spk_emb = 82