Vocal Stem Quality Issue - Trained on Lossy?

Dyslexicon commented 1 year ago

Hello - this new model beats htdemucs_ft hands down, except that the vocal stem appears to be trained on lossy source material.

Drum, Bass, and "Other" stems are a gargantuan improvement over htdemucs_ft, I congratulate and thank you.

Vocal stem however has a band of mis-identified/mis-assigned frequencies from 15-18KHz, and then the response goes completely dead above 18KHz.

You can examine this issue by using Spek (audio spectral analyzer) https://www.spek.cc Please compare vocal stem outputs from regular htdemucs_ft, vs. vocal stem outputs from this new model - you can see side by side that this new model has spectral band issues, most likely from being trained on lossy/compressed source material. To be clear - there is a band of "garbage frequencies" that dont belong in the vocal stem, from 15-18KHz, and then the vocal stem goes totally dead above 18KHz; whereas regular htdemucs_ft retains astonishing quality and accuracy all the way up to 22KHz.

Not complaining - this model is a massive improvement over htdemucs_ft on the other stems - but if you would please consider re-training the vocal stem so that it is even identical to regular htdemucs_ft, or slightly better, that would be fantastic. I want this model to be the new gold standard! Just a bit of tweaking and it can be.

Is there an alternate version of this model that IS trained on full-bandwidth vocal material?

This model is so close to beating htdemucs_ft hands down - but the vocal stem needs to be fixed. Thank you for your efforts!

Dyslexicon commented 1 year ago

Maybe the easiest thing to do would be recompile the model using the vocal stem training set from htdemucs_ft only, and keep the significant improvements attained in this model with drums/bass/other.

jarredou commented 1 year ago

Band splitting and keep only demucs_ft above 15khz for vocals ?

Dyslexicon commented 1 year ago

Band splitting and keep only demucs_ft above 15khz for vocals ?

These two models are different enough they are not interchangeable, so if you manually attempt to combine different segments, portions will be missing from the Instrumental stems, and/or some samples will be additive, causing artifacts and or phasing. Tried it already - not good

In essence rebuild the model incorporating the significant improvements in the 3 instrumental stems, but dont use whatever lossy/mp3/web modeling that was used to further train the vocal stem, and instead, keep the htdemucs_ft vocal part of the model training set; then recompile the full model.

jarredou commented 1 year ago

I think there's is an issue when the phase cancellation occurs between mixture and separated stems. As some models have a frequency cutoff, when merging the mixture (fullband) and the inverted separated stem (filtered), there is no phase cancellation above the cutoff and from there, all the mixture leaks in the secondary stem.

This is also happening with the MDX-Colab from Audio Separation discord. But not in UVR ! There must be some lowpass filtering going on somewhere.

Dyslexicon commented 1 year ago

Will wait patiently for a resolution to this, so I can go crazy on projects I've held off on for the past 2 years waiting for a stem separation model this good

Dyslexicon commented 1 year ago

Visual Aide for what I'm describing here - same song Vocals separated with htdemucs_ft and MVSEP-MDX2023

Vocals htdemucs_ft Vocals MVSep-MDX2023

False frequencies generated in bands from about 15-18KHz; everything above 15KHz is useless. Overall cleanliness of the stem is poorer than plain htdemucs_ft, but MVSep does a bit better job at avoiding mis-identifying guitar sounds as vocal sounds.

If the lossy Vocal model could be swapped out for a full-spectrum model (even if it is simply un-altered htdemucs_ft) and the scripting likewise fixed in the Interference file (and whatever other steps are necessary), you'll have the best model available.

The 1st and 2nd place winners will not share their models so this is in fact the best model!!

Willing to make a donation if this Vocal issue is fixed such that it's as good or better than plain htdemucs_ft on vocals - it's worth a lot to me!

Thanks

kubinka0505 commented 1 year ago

>sdr above 9 >trained on lossy

jokes aside how is that possible

turian commented 7 months ago

This is also happening with the MDX-Colab from Audio Separation discord. But not in UVR !

@jarredou which UVR vocals model is currently giving you the best results?

jarredou commented 7 months ago

@turian I'm mostly using my fork of this script (https://github.com/jarredou/MVSEP-MDX23-Colab_v2/), which, to separate vocals, uses mostly InstVocHQ model (which is available in UVR) and a bit of the VitLarge one released by ZFTurbo with his universal training code (and which is not available in UVR). I haven't changed the 4stems part involving the multiple Demucs models, which is still one of the best ensemble freely available for drums/bass/other stems.

ZFTurbo / MVSEP-MDX23-music-separation-model

Vocal Stem Quality Issue - Trained on Lossy? #2