Open khacanh opened 1 year ago
Hi team,
Thank you very much for releasing this model!
I'm curious about training/inference with WavLM to improve performance.
Running inference with WavLM-Base throws this error:
% python convert.py --hpfile configs/freevc.json --ptfile checkpoints/freevc.pt --txtpath convert.txt --outdir outputs/freevc Loading model... Loading checkpoint... INFO:root:Loaded checkpoint 'checkpoints/freevc.pt' (iteration 1372) Loading WavLM for content... {} <wavlm.WavLM.WavLMConfig object at 0x140b2ea50> INFO:wavlm.WavLM:WavLM Config: {'extractor_mode': 'default', 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': 'gelu', 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'feature_grad_mult': 0.1, 'normalize': False, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': 'static', 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': 'static', 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'relative_position_embedding': True, 'num_buckets': 320, 'max_distance': 800, 'gru_rel_pos': True, 'expand_attention_head_size': -1} Loading speaker encoder... Loaded the voice encoder model on cpu in 0.01 seconds. Processing text... Synthesizing... 0it [00:00, ?it/s]/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/librosa/effects.py:490: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result. return y[full_index], np.asarray([start, end]) 0it [00:06, ?it/s] Traceback (most recent call last): File "convert.py", line 83, in <module> audio = net_g.infer(c, g=g_tgt) File "[...]/freevc/FreeVC/models.py", line 347, in infer z_p, m_p, logs_p, c_mask = self.enc_p(c, c_lengths) File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "[...]/freevc/FreeVC/models.py", line 72, in forward x = self.pre(x) * x_mask File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 298, in forward return self._conv_forward(input, self.weight, self.bias) File "/Users/macos/.pyenv/versions/3.7.10/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 295, in _conv_forward self.padding, self.dilation, self.groups) RuntimeError: Given groups=1, weight of size [192, 1024, 1], expected input[1, 768, 1030] to have 1024 channels, but got 768 channels instead
So my guess is that we need to train with WavLM-Base first, and then change the config file ./configs/freecv.json to run inference?
./configs/freecv.json
Thank you in advance.
Regards, KA
@khacanh I think you ssl_dim 1030 for WavLm Base. reshape the dimension into [768, 1030, 1]
Hi team,
Thank you very much for releasing this model!
I'm curious about training/inference with WavLM to improve performance.
Running inference with WavLM-Base throws this error:
So my guess is that we need to train with WavLM-Base first, and then change the config file
./configs/freecv.json
to run inference?Thank you in advance.
Regards, KA