Open manosplitsis opened 1 year ago
maybe this can work: https://github.com/PlayVoice/so-vits-svc-5.0/blob/main/bandex/inference.py
Ok, I fixed the issue with audio ending at 30 second intervals by adding some padding before unfolding the audio tensor (I will make a PR soon). The timeshift behaviour continues though as seen in the image below:
I tested passing a file from the _stream function without using the model, just taking each frame as given by x.unfold and fading in and out of the last frame. The output file is identical to the input, so the _stream function should be correct. At the same time, the first frame is (almost) identical to the original in the non-reconstructed frequencies, so it is not that the model always shifts those frequencies. Any ideas?
After training the network on a dataset of music, when using the inference script with my trained model and comparing the results with the original audio files, I noticed that after the 30 second mark the audio seems to be time-shifted compared to the original.
I flipped the phase in the upsampled recording and mixed it with the original audio in the third track. Before the 30 second mark, mostly high frequencies are left, which is to be expected. After that, a phase-shifted version of the audio is left, which makes me suspect that the method to overlap different segments slightly shifts the new segments in time. If not, maybe this is an inherent problem with the method?
Another issue with the inference script is that generation is always cut off at a 30-second interval, not including the ending of the audio.
I'll try to look into it, just wondering if anyone else dealt with these issues before me. Also, does anyone have any insight if the overlap segments method works generally in audio generation of long audio, or if there are more elegant solutions when using non-autoregressive architectures?