Vidalnt / Applio

A simple, high-quality voice conversion tool focused on ease of use and performance.
https://applio.org
MIT License
1.58k stars 254 forks source link

[Feature]: Add RMSEnergyExtractor [audio feature extraction]? #698

Open Mixomo opened 1 week ago

Mixomo commented 1 week ago

Description

What Does RMSEnergyExtractor Do?

Calculates RMS Energy:

RMS energy is a measure of the power of an audio signal. It is computed as the square root of the average of the squared amplitudes of the signal over a specific time interval. This metric is useful because it provides an estimate of the "strength" or "intensity" of the sound, which can help in tasks such as volume normalization, sound event detection, or as a feature for speech synthesis and recognition models.

Usage in Model Training Context:

The RMSEnergyExtractor is used as part of the model's preprocessing pipeline. It specifically extracts energy features from the audio signal that are later used as input or auxiliary features for training purposes.

Problem

Why is RMS Energy Useful in Voice-Related Model Training?

Volume Normalization: RMS energy allows the model to differentiate between audio segments with different energy or volume levels, which is crucial for generating natural and accurate voice synthesis.

Speech Feature Detection: It helps identify parts of the audio with vocal activity or silence, providing an additional signal that can improve model quality.

Proposed Solution

In the early days of RVC about a year ago, I suggested in the original repository the implementation of this feature, since I had been experimenting before with other previous SVC programs and one of them integrated this function, which made the output much more expressive and natural, effectively capturing the differences in timbre between soft and loud passages that were presented in the trained dataset. However, some time later they replied that they had added something similar but as a post processing effect, (what we know today in the interface as the input/output slider) which obviously did not have the same effect since it did not use data trained on the timbre variation and the singer's expressiveness present in the dataset.

Here I mention the details (among other things that are no longer important):

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/169#issue-1683467423

I don't know if it will be feasible, or if the code needs a lot of tweaking, but I'm just leaving it written here. Maybe it can be implemented in future versions of Applio, not necessarily in the short term.

Thanks for reading :)

Alternatives Considered

I don't know any others alternatives at the moment.

embis0126 commented 6 days ago

This sounds like a great idea.