Closed ZFTurbo closed 2 years ago
Yes, what you do is a correct way. Since the model will cut the track into small duration pieces, make their separations and concat them together, a very small final piece will be dropped or cut (usually they are silence frame or end frame). You can align it according to the vocal shape, usually the vocal shape is a 1-2 sec smaller than the original shape.
This can be seen in our processing code in the model, such as from line 771-802 in asp_model.py.
Hello. After applying model the size of output is slightly different from input. For example 9265664 became 9216000. It gives problem for validation on MUSDB dataset as well as creating inverse of extracted stem. What is the best way to align output to input. Right now I do the following: