I am currently using the Retrieval-based Voice Conversion (RVC) model for voice conversion tasks on English Audio, and I have observed a higher-than-expected Word Error Rate (WER) in the converted audio output. While the timbre and overall speech characteristics are well-preserved, there are noticeable discrepancies in the pronunciation and word clarity that affect the intelligibility of the converted speech.
I would like to understand what steps can be taken to reduce the WER in RVC-generated audio. Specifically, I am looking for:
Model adjustments: Are there any tweaks in model architecture or hyperparameters that could improve the word accuracy in the output?
Preprocessing/Postprocessing: What preprocessing techniques (e.g., noise reduction, normalization) or postprocessing steps could help reduce WER?
Training dataset: How crucial is the quality and quantity of the training data? Should I prioritize more phonetically diverse datasets, or can any enhancements be made on the dataset level to improve WER?
Fine-tuning: Are there recommendations for fine-tuning the model, particularly for improving articulation and reducing mispronunciations in the converted audio?
I am currently using the Retrieval-based Voice Conversion (RVC) model for voice conversion tasks on English Audio, and I have observed a higher-than-expected Word Error Rate (WER) in the converted audio output. While the timbre and overall speech characteristics are well-preserved, there are noticeable discrepancies in the pronunciation and word clarity that affect the intelligibility of the converted speech.
I would like to understand what steps can be taken to reduce the WER in RVC-generated audio. Specifically, I am looking for: