brentspell / hifi-gan-bwe

Unofficial implementation of HiFi-GAN+ from the paper "Bandwidth Extension is All You Need" by Su, et al.
MIT License
201 stars 26 forks source link

Inference bug in overlap-add of segments and total inference time #9

Open manosplitsis opened 1 year ago

manosplitsis commented 1 year ago

After training the network on a dataset of music, when using the inference script with my trained model and comparing the results with the original audio files, I noticed that after the 30 second mark the audio seems to be time-shifted compared to the original.

image

I flipped the phase in the upsampled recording and mixed it with the original audio in the third track. Before the 30 second mark, mostly high frequencies are left, which is to be expected. After that, a phase-shifted version of the audio is left, which makes me suspect that the method to overlap different segments slightly shifts the new segments in time. If not, maybe this is an inherent problem with the method?

Another issue with the inference script is that generation is always cut off at a 30-second interval, not including the ending of the audio.

I'll try to look into it, just wondering if anyone else dealt with these issues before me. Also, does anyone have any insight if the overlap segments method works generally in audio generation of long audio, or if there are more elegant solutions when using non-autoregressive architectures?

AmorJNYH commented 1 year ago

maybe this can work: https://github.com/PlayVoice/so-vits-svc-5.0/blob/main/bandex/inference.py

manosplitsis commented 1 year ago

Ok, I fixed the issue with audio ending at 30 second intervals by adding some padding before unfolding the audio tensor (I will make a PR soon). The timeshift behaviour continues though as seen in the image below:

Screenshot from 2023-05-13 22-06-43

I tested passing a file from the _stream function without using the model, just taking each frame as given by x.unfold and fading in and out of the last frame. The output file is identical to the input, so the _stream function should be correct. At the same time, the first frame is (almost) identical to the original in the non-reconstructed frequencies, so it is not that the model always shifts those frequencies. Any ideas?