RVC-Project / Retrieval-based-Voice-Conversion

in preparation...
MIT License
240 stars 37 forks source link

some errors throw during infer, and output file generated is bad quality #14

Closed ybwai closed 5 months ago

ybwai commented 6 months ago

MacBook Pro Intel i9 8-Core / AMD Radeon Pro 5300M / 32GB DDR4 RAM / macOS Sanoma 14.2 Python 3.10.13 Poetry 1.7.1 CLI command:

PYTORCH_ENABLE_MPS_FALLBACK=1  rvc infer -rmr 1 -p 0 -ir 0.75  -m weights/Peter/model.pth -if weights/Peter/index.index -i input.mp3 -o output1.mp3

command output:

NFO:rvc.configs.config:No supported Nvidia GPU found
INFO:rvc.configs.config:overwrite configs.json
INFO:rvc.configs.config:Use mps instead
INFO:rvc.configs.config:is_half:False, device:mps
UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
DEBUG:rvc.lib.infer_pack.models:gin_channels: 256, self.spk_embed_dim: 109
INFO:rvc.modules.vc.modules:Select index: 
INFO:fairseq.tasks.hubert_pretraining:current directory is /Retrieval-based-Voice-Conversion
INFO:fairseq.tasks.hubert_pretraining:HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': 'metadata', 'fine_tuning': False, 'labels': ['km'], 'label_dir': 'label', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': False, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False}
INFO:fairseq.models.hubert.hubert:HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': default, 'encoder_layers': 12, 'encoder_embed_dim': 768, 'encoder_ffn_embed_dim': 3072, 'encoder_attention_heads': 12, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.1, 'attention_dropout': 0.1, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.05, 'dropout_input': 0.1, 'dropout_features': 0.1, 'final_dim': 256, 'untie_final_proj': True, 'layer_norm_first': False, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 0.1, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'conv_pos_batch_norm': False, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': False, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False}
Traceback (most recent call last):
  File "/Retrieval-based-Voice-Conversion/rvc/modules/vc/pipeline.py", line 307, in pipeline
    index = faiss.read_index(file_index)
  File "/Retrieval-based-Voice-Conversion/.venv/lib/python3.10/site-packages/faiss/swigfaiss_avx2.py", line 9924, in read_index
    return _swigfaiss_avx2.read_index(*args)
TypeError: Wrong number or type of arguments for overloaded function 'read_index'.
  Possible C/C++ prototypes are:
    faiss::read_index(char const *,int)
    faiss::read_index(char const *)
    faiss::read_index(FILE *,int)
    faiss::read_index(FILE *)
    faiss::read_index(faiss::IOReader *,int)
    faiss::read_index(faiss::IOReader *)

    INFO:rvc.modules.vc.pipeline:Loading rmvpe model,assets/rmvpe/rmvpe.pt
/Retrieval-based-Voice-Conversion/.venv/lib/python3.10/site-packages/torch/functional.py:650: UserWarning: The operator 'aten::_fft_r2c' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/mps/MPSFallback.mm:13.)
  return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore[attr-defined]
/Retrieval-based-Voice-Conversion/rvc/lib/infer_pack/attentions.py:334: UserWarning: MPS: The constant padding of more than 3 dimensions is not currently supported natively. It uses View Ops default implementation to run. This may have performance implications. (Triggered internally at /Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/operations/Pad.mm:474.)
  x = F.pad(
{'npy': 6.011045217514038, 'f0': 135.6644949913025, 'infer': 27.060052633285522}
Finish inference. Check output1.mp3

Although I get an output file, the sound has lots of artefacts/noise and is not smooth at all. I see some warnings and errors in the console output, are they the cause? or is it the models I am using?

Also how to get the output combined with the instrumental when using music audio?

Thanks,

Tps-F commented 6 months ago

The index file does not seem to be loaded Check the path or try the full path.