OlaWod / FreeVC

FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion
MIT License
602 stars 111 forks source link

Inference or train with WavLM-Base or WavLM-Base+? #69

Open khacanh opened 1 year ago

khacanh commented 1 year ago

Hi team,

Thank you very much for releasing this model!

I'm curious about training/inference with WavLM to improve performance.

Running inference with WavLM-Base throws this error:

% python convert.py --hpfile configs/freevc.json --ptfile checkpoints/freevc.pt --txtpath convert.txt --outdir outputs/freevc
Loading model...
Loading checkpoint...
INFO:root:Loaded checkpoint 'checkpoints/freevc.pt' (iteration 1372)
Loading WavLM for content...
{} <wavlm.WavLM.WavLMConfig object at 0x140b2ea50>
INFO:wavlm.WavLM:WavLM Config: {'extractor_mode': 'default', 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': 'gelu', 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 0.1, 'normalize': False, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True, 'expand_attention_head_size': -1}
Loading speaker encoder...
Loaded the voice encoder model on cpu in 0.01 seconds.
Processing text...
Synthesizing...
0it [00:00, ?it/s]/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/librosa/effects.py:490: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return y[full_index], np.asarray([start, end])
0it [00:06, ?it/s]
Traceback (most recent call last):
  File "convert.py", line 83, in <module>
    audio = net_g.infer(c, g=g_tgt)
  File "[...]/freevc/FreeVC/models.py", line 347, in infer
    z_p, m_p, logs_p, c_mask = self.enc_p(c, c_lengths)
  File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "[...]/freevc/FreeVC/models.py", line 72, in forward
    x = self.pre(x) * x_mask
  File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl
    return forward_call(*input, **kwargs)
  File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 298, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 295, in _conv_forward
    self.padding, self.dilation, self.groups)
RuntimeError: Given groups=1, weight of size [192, 1024, 1], expected input[1, 768, 1030] to have 1024 channels, but got 768 channels instead

So my guess is that we need to train with WavLM-Base first, and then change the config file ./configs/freecv.json to run inference?

Thank you in advance.

Regards, KA

MuruganR96 commented 1 year ago

@khacanh I think you ssl_dim 1030 for WavLm Base. reshape the dimension into [768, 1030, 1]