RVC-Project / Retrieval-based-Voice-Conversion-WebUI

Easily train a good VC model with voice data <= 10 mins!
MIT License
24.76k stars 3.63k forks source link

How to reduce WER (Word Error Rate) of RVC output #2364

Open asr-aditya opened 3 weeks ago

asr-aditya commented 3 weeks ago

I am currently using the Retrieval-based Voice Conversion (RVC) model for voice conversion tasks on English Audio, and I have observed a higher-than-expected Word Error Rate (WER) in the converted audio output. While the timbre and overall speech characteristics are well-preserved, there are noticeable discrepancies in the pronunciation and word clarity that affect the intelligibility of the converted speech.

I would like to understand what steps can be taken to reduce the WER in RVC-generated audio. Specifically, I am looking for:

  1. Model adjustments: Are there any tweaks in model architecture or hyperparameters that could improve the word accuracy in the output?
  2. Preprocessing/Postprocessing: What preprocessing techniques (e.g., noise reduction, normalization) or postprocessing steps could help reduce WER?
  3. Training dataset: How crucial is the quality and quantity of the training data? Should I prioritize more phonetically diverse datasets, or can any enhancements be made on the dataset level to improve WER?
  4. Fine-tuning: Are there recommendations for fine-tuning the model, particularly for improving articulation and reducing mispronunciations in the converted audio?