Add RMSEnergyExtractor [audio feature extraction]?

Hello! I'm copying this from a different RVC fork as it sounds like a really useful feature to help improve realism of models!

Description

What Does RMSEnergyExtractor Do?

Calculates RMS Energy:

RMS energy is a measure of the power of an audio signal. It is computed as the square root of the average of the squared amplitudes of the signal over a specific time interval. This metric is useful because it provides an estimate of the "strength" or "intensity" of the sound, which can help in tasks such as volume normalization, sound event detection, or as a feature for speech synthesis and recognition models.

Usage in Model Training Context:

The RMSEnergyExtractor is used as part of the model's preprocessing pipeline. It specifically extracts energy features from the audio signal that are later used as input or auxiliary features for training purposes.

Problem

Why is RMS Energy Useful in Voice-Related Model Training?

Volume Normalization: RMS energy allows the model to differentiate between audio segments with different energy or volume levels, which is crucial for generating natural and accurate voice synthesis.

Speech Feature Detection: It helps identify parts of the audio with vocal activity or silence, providing an additional signal that can improve model quality.

Proposed Solution

In the early days of RVC about a year ago, I suggested in the original repository the implementation of this feature, since I had been experimenting before with other previous SVC programs and one of them integrated this function, which made the output much more expressive and natural, effectively capturing the differences in timbre between soft and loud passages that were presented in the trained dataset. However, some time later they replied that they had added something similar but as a post processing effect, (what we know today in the interface as the input/output slider) which obviously did not have the same effect since it did not use data trained on the timbre variation and the singer's expressiveness present in the dataset.

Here I mention the details (among other things that are no longer important):

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/issues/169#issue-1683467423

fumiama / Retrieval-based-Voice-Conversion-WebUI