IAHispano / Applio

A simple, high-quality voice conversion tool focused on ease of use and performance
https://applio.org
MIT License
1.82k stars 293 forks source link

added lookahead and falloff to the split-audio option #797

Closed Chilluminati91 closed 1 month ago

Chilluminati91 commented 1 month ago

The "Split Audio" option on inference tends to cut into the audio when the threshold is reached on attack and release. This adds a little lookahead and falloff so there are no more drastic cuts.

blaisewf commented 1 month ago

@AznamirWoW you're the expert on this

AznamirWoW commented 1 month ago

I don't see much improvement from this method. It just makes non-silent chunks a bit larger and includes some silence.

Current method at the top, the proposed method is at the bottom.

image

AznamirWoW commented 1 month ago

Ran a test on a larger 45 min audio file. Both old and new methods produce the same number (61) of chunks, so no improvement here either.

AznamirWoW commented 1 month ago

X:\ApplioV3.2.6>split.py intervals: [(0, 236000), (238800, 1268000), (1266800, 3400000), (3400800, 4516000), (4516800, 4706000), (4704800, 4896000), (4894800, 5330000), (5328800, 5364000), (5370800, 5928000), (5924800, 10084000), (10080800, 10258000), (10256800, 13770000), (13766800, 13988000), (13984800, 15528000), (15524800, 15764000), (15760800, 16228000), (16224800, 16992000), (16988800, 17506000), (17504800, 17940000), (17936800, 18262000), (18260800, 18830000), (18826800, 19454000), (19452800, 20368000), (20366800, 20926000), (20922800, 21380000), (21380800, 22820000), (22816800, 23532000), (23528800, 23562000), (23558800, 23566000), (23564800, 23648000), (23644800, 23720000), (23716800, 24096000), (24094800, 25894000), (25890800, 25924000), (25920800, 26092000), (26088800, 26188000), (26188800, 26530000), (26528800, 26542000), (26542800, 27740000), (27744800, 27922000), (27918800, 27928000), (27924800, 27986000), (27982800, 28996000), (28996800, 30042000), (30038800, 30130000), (30126800, 30560000), (30556800, 31096000), (31092800, 31128000), (31126800, 31168000), (31168800, 31212000), (31210800, 32040000), (32038800, 32402000), (32400800, 32704000), (32702800, 34668000), (34668800, 34984000), (34980800, 35600000), (35604800, 36846000), (36842800, 37176000), (37172800, 38908000), (38906800, 39608000), (39606800, 40078000), (40074800, 42793634)]

merging: (238800, 1268000) (0, 236000) duration: 2800

merging: (1266800, 3400000) (238800, 1268000) duration: -1200 Traceback (most recent call last): File "X:\ApplioV3.2.6\split.py", line 15, in audio_opt = merge_audio(chunks, intervals, 16000, 16000) File "X:\ApplioV3.2.6\rvc\lib\tools\split_audio2.py", line 85, in merge_audio silence = np.zeros(silence_duration, dtype=audio_segments[0].dtype) ValueError: negative dimensions are not allowed

Chilluminati91 commented 1 month ago

I can not showcase it right now with the change but sometimes with no lookahead you get audible clicks in the transformed waveform. If you cut into an audio signal and that cut is not at a 0 point in the waveform it is bound to happen. This could either be changed by introducing lookahead and fallof or a slight fade at each cut, but that would result in loss of data.

Check the attached files, one transformed with a slight lookahead and one transformed with the lookahead removed (original file does not click, it only occurs after inference).

test.zip

To illustrate, input yellow - output purple. Screenshot 2024-10-09 204457

AznamirWoW commented 1 month ago

I don't see the current split method splitting in the middle of the waveform. image

Chilluminati91 commented 1 month ago

It does not have to. It could split exactly at a 0 point and we would still have popping issues as the converted file might not start at a 0 point. Once its assembled back together the signal pops.

AznamirWoW commented 1 month ago

Anyway, your method fails on my test file and results in a negative overlap.

Chilluminati91 commented 1 month ago

After some further tests, you are right this overcomplicates things just for the sake of it and adds potential errors with clip and silence lengths. A simple minimal fade before conversion and then before merging back together the clips is enough to get rid of potential clicks.

AznamirWoW commented 1 month ago

So you say a converted file may not start at 0 point and result in a click, yet you fade the audio before the conversion?

Chilluminati91 commented 1 month ago

We have no influence on wether the converted audio starts at a 0 point. If we fade the audio BEFORE conversion we can minimize the propability of having the original audio with a click going through conversion. Then we fade the converted audio because we can not say for certain wether it starts with a click or not.

AznamirWoW commented 1 month ago

Once again I'd like to see a file where splitting happens at a loud enough volume to produce a click. In all my test the split happens at a very low point and after merging the converted audio back it no audible difference at the split.