Closed Chilluminati91 closed 1 month ago
@AznamirWoW you're the expert on this
I don't see much improvement from this method. It just makes non-silent chunks a bit larger and includes some silence.
Current method at the top, the proposed method is at the bottom.
Ran a test on a larger 45 min audio file. Both old and new methods produce the same number (61) of chunks, so no improvement here either.
X:\ApplioV3.2.6>split.py intervals: [(0, 236000), (238800, 1268000), (1266800, 3400000), (3400800, 4516000), (4516800, 4706000), (4704800, 4896000), (4894800, 5330000), (5328800, 5364000), (5370800, 5928000), (5924800, 10084000), (10080800, 10258000), (10256800, 13770000), (13766800, 13988000), (13984800, 15528000), (15524800, 15764000), (15760800, 16228000), (16224800, 16992000), (16988800, 17506000), (17504800, 17940000), (17936800, 18262000), (18260800, 18830000), (18826800, 19454000), (19452800, 20368000), (20366800, 20926000), (20922800, 21380000), (21380800, 22820000), (22816800, 23532000), (23528800, 23562000), (23558800, 23566000), (23564800, 23648000), (23644800, 23720000), (23716800, 24096000), (24094800, 25894000), (25890800, 25924000), (25920800, 26092000), (26088800, 26188000), (26188800, 26530000), (26528800, 26542000), (26542800, 27740000), (27744800, 27922000), (27918800, 27928000), (27924800, 27986000), (27982800, 28996000), (28996800, 30042000), (30038800, 30130000), (30126800, 30560000), (30556800, 31096000), (31092800, 31128000), (31126800, 31168000), (31168800, 31212000), (31210800, 32040000), (32038800, 32402000), (32400800, 32704000), (32702800, 34668000), (34668800, 34984000), (34980800, 35600000), (35604800, 36846000), (36842800, 37176000), (37172800, 38908000), (38906800, 39608000), (39606800, 40078000), (40074800, 42793634)]
merging: (238800, 1268000) (0, 236000) duration: 2800
merging: (1266800, 3400000) (238800, 1268000) duration: -1200 Traceback (most recent call last): File "X:\ApplioV3.2.6\split.py", line 15, in
I can not showcase it right now with the change but sometimes with no lookahead you get audible clicks in the transformed waveform. If you cut into an audio signal and that cut is not at a 0 point in the waveform it is bound to happen. This could either be changed by introducing lookahead and fallof or a slight fade at each cut, but that would result in loss of data.
Check the attached files, one transformed with a slight lookahead and one transformed with the lookahead removed (original file does not click, it only occurs after inference).
To illustrate, input yellow - output purple.
I don't see the current split method splitting in the middle of the waveform.
It does not have to. It could split exactly at a 0 point and we would still have popping issues as the converted file might not start at a 0 point. Once its assembled back together the signal pops.
Anyway, your method fails on my test file and results in a negative overlap.
After some further tests, you are right this overcomplicates things just for the sake of it and adds potential errors with clip and silence lengths. A simple minimal fade before conversion and then before merging back together the clips is enough to get rid of potential clicks.
So you say a converted file may not start at 0 point and result in a click, yet you fade the audio before the conversion?
We have no influence on wether the converted audio starts at a 0 point. If we fade the audio BEFORE conversion we can minimize the propability of having the original audio with a click going through conversion. Then we fade the converted audio because we can not say for certain wether it starts with a click or not.
Once again I'd like to see a file where splitting happens at a loud enough volume to produce a click. In all my test the split happens at a very low point and after merging the converted audio back it no audible difference at the split.
The "Split Audio" option on inference tends to cut into the audio when the threshold is reached on attack and release. This adds a little lookahead and falloff so there are no more drastic cuts.