RetroCirce / Zero_Shot_Audio_Source_Separation

The official code repo for "Zero-shot Audio Source Separation through Query-based Learning from Weakly-labeled Data", in AAAI 2022
https://arxiv.org/abs/2112.07891
MIT License
184 stars 31 forks source link

Different length of input and output #11

Closed ZFTurbo closed 2 years ago

ZFTurbo commented 2 years ago

Hello. After applying model the size of output is slightly different from input. For example 9265664 became 9216000. It gives problem for validation on MUSDB dataset as well as creating inverse of extracted stem. What is the best way to align output to input. Right now I do the following:

    if audio.shape[-1] > vocals.shape[-1]:
        audio = audio[..., :vocals.shape[-1]]
    invert = audio - vocals
RetroCirce commented 2 years ago

Yes, what you do is a correct way. Since the model will cut the track into small duration pieces, make their separations and concat them together, a very small final piece will be dropped or cut (usually they are silence frame or end frame). You can align it according to the vocal shape, usually the vocal shape is a 1-2 sec smaller than the original shape.

This can be seen in our processing code in the model, such as from line 771-802 in asp_model.py.