deezer / spleeter

Deezer source separation library including pretrained models.
https://research.deezer.com/projects/spleeter.html
MIT License
25.54k stars 2.79k forks source link

[Discussion] Custom Model does not separate at all. #411

Open JavaShipped opened 4 years ago

JavaShipped commented 4 years ago

Hi all,

I've posted here a few times, this spleeter thing isn't a walk in the park! BUT, I have managed to almost get there in the end. I just need help training a model, all my models don't seem to work.

A little backstory about my use case - I'm trying to adapt spleeter to separate dialogue from films. In the Fan Edit community, this is basically the hardest part of editing release material. Some films and shows have a 'clean' centre channel, with only dialogue, and this makes life super easy, most do not. If I can figure this out, it will basically revolutionise fan edits and make edits previously impossible due to music bleed, possible.

I have compiled a collection of shows and movies I could get my hands on with 100% clean centre channels for audio dialogue. No music bleed. This was extremely time consuming! With the idea I could train a model that output "vocals" and "other".

I ended up with 160 odd episodes and 3 films. Roughly 90 hours or so of raw data.

Steps so far:

  1. Using ffmpeg, I took the 5.1 Audio from the films/episodes. This gave me 6 channels: [FL][FR][FC][LFE][BL][BR] where the centre channel was entirely dialogue and no music. I verified this for every data point.

  2. Using ffmpeg I created a mono track that downmixed the entire 5.1 audio to 1x mono. Giving me dialogue, music and effects.

  3. I organised the config files as instructed. With a .csv file with "mix_path" containing the mono full downmixed film audio, "vocal_path" containing the centre channel audio and "other_path" containing a channel with the music mix, and edited the config.json with the new paths and my data directories.

  4. I ran spleeter train -p ~/Configs/filmModel.json -d ~/filmDataSet. This gave me:

    INFO:spleeter:Start model training
    INFO:spleeter:Model training done

    And this output the model into the directory "filmModel" (as the config file told it to). This took a suspiciously small amount of time for 90 hours of audio...

  5. I then used spleeter separate -i "~\Testfilm.mp4" -o Test -p "~\Configs\filmModel.json"

  6. It did its thing and ouput:

    INFO:spleeter:File ~/vocals.wav written succesfully
    INFO:spleeter:File ~/other.wav written succesfully

    This also took a very short amount of time, but this was less suspicious as the test file I was using is only 1m50s.

  7. I went to audacity and dragged my two files in and they are practically identical.

  8. Tested with spleeter:2stems on the same ~\Testfilm.mp4 and it does output music and vocals (but it's not that good).

Am I training correctly? I can't seem to figure out why my model doesn't work! Anyone have any advice, tips or things to try?

One thing that did come to mind was that the length of these files might be too much for spleeter. I ran into issues trying to use spleeter separate on lengthy files, spleeter threw tons of errors. I currently use a 1m50s sample file of a film not in my training data, this does not throw any errors and works fine with spleeter:2stem. No errors are thrown for training my custom model, but could this be a possible issue?

junh1024 commented 4 years ago

I think the included 2stems model should do an ok job on dialogue. Have you tried?

JavaShipped commented 4 years ago

I think the included 2stems model should do an ok job on dialogue. Have you tried?

It actually does a pretty good job considering it's a tangential type of data to music tracks, but I really thought I'd be able to refine this further with an extensive model dedicated to film's.

junh1024 commented 4 years ago
  1. I think it would be hard to improve it by a lot, but you're welcome to prove me wrong & impress me.
  2. C sometimes has SFX &/ music so it can spoil your model. LR also sometimes has dialog.

If you want an alternative, there's https://github.com/facebookresearch/demucs which is FULL bandwidth , but it has the same flaw as spleetter , ie, crashing on long files.

Also, this would prolly be tough, but definitely useful: a model that separates SFX from music. But idk how you'd get training data.

JavaShipped commented 4 years ago
  1. I think it would be hard to improve it by a lot, but you're welcome to prove me wrong & impress me.

I would be surprised if it could be improved by a significant amount, but I was hoping that one trained for the purpose of film would be a little more specific and possibly better for that specific application. When I used the spleeter:2stems model, there are weird audio artifacts in the silence (where there is no dialogue or FX).

  1. C sometimes has SFX &/ music so it can spoil your model. LR also sometimes has dialog.

SFX messing up training is an interesting one, as the channel I'm using with music also includes the SFX. Though if the SFX are identical in both the "fullmix_path" and "other_path", this really shouldn't make a difference? This could total by the reason I'm having trouble, It would be great to have a second opinion on this too.

In terms of LR having dialogue. I was very particular about the sources I chose, and screened the other channels (obviously only a random sample) of the LR/BR channels and they don't have any dialogue bleed in them. Note: Often films and shows with no clean centre channels do put vocals in the other surround channels as well, I am not sure why, as they are so quiet, it barely makes a difference.

Right now, for step one I'd settle on figuring out why:

junh1024 commented 4 years ago

I was hoping that one trained for the purpose of film would be a little more specific and possibly better for that specific application.

voice & program is film. voice & instruments is music. I think it's a more a difference in purpose. There's some (small) differences & It depenz who you ask, but there's a lot that's the same.

When I used the spleeter:2stems model, there are weird audio artifacts in the silence (where there is no dialogue or FX).

Have you tried demucs as above? It's sometimes better. Also, you can (manually) gate afterwards.

suggest a way to split a whole film file (~2h20m) using spleeter:2stems, spleeter seems to have issues with this, and it doesn't seem to be memory related as other posts have indicated.

If you have unlimited RAM maybe it would work, but here's a splitting script https://github.com/deezer/spleeter/issues/391#issuecomment-633575082

PS: I'm on OT but I go there only occasionally.

aidv commented 4 years ago

I've noticed that training successful models seem to be extremely hard.

tomjmanuel commented 4 years ago

I have attempted training a 2stem model for saxophone seperation. I have 60 seperated sax tracks to work with. I was thinking that there's no reason a 2 stem model can't be trained to isolate sax instead of vocals, but so far my results are terrible. Is there a reason why this shouldn't work? Also, is there any documentation on the configuration options specified in the config.json?

jonahkaplan1 commented 4 years ago

My model takes under a minute to train when there is 90 hours of data. This seems very suspicious.

That's very.. suspicious. To put it lightly any ML model training on lots of data will not take less than a minute. I assume 90 hours of data is a good amount (> 5gb?), but what is the data size in GB?

I'm currently training a spleeter model but updates have stopped after INFO:spleeter:Audio data loaded successfully INFO:spleeter:Audio data loaded successfully

I'm on hour ~20. Training on CPU. About 2Gb of input data.

Idk if what im doing is "working" but it hasnt errored out or completed yet. Just trying to help you with some benchmarking.

Unfortunately I've found Spleeter isn't very user friendly for training custom models

tvielott commented 3 years ago

Having the same problem here - everything for the custom model seems to run but it doesn't actually split - each "split" output is just an identical copy of the input. I don't know at all how to diagnose this problem.

xinmingliunicky commented 3 years ago

Same trouble here, the 4-stems model I trained just gave me 4 outputs identical to the input when I ran test. Perhaps this problem is not related to our training data. Anyway, this is definitely a big issue of Spleeter.

patcon commented 3 years ago

I'm trying to adapt spleeter to separate dialogue from films. [...] If I can figure this out, it will basically revolutionise fan edits

Another use-case beyond fan edits: The best way to learn a language has always been to listen/watch/expose yourself to that language as much as possible. But some languages don't have much media -- no movies, or film or recorded content. [1] This is especially true of endangered languages.

Splitting out audio could allow people to support overdubs in little-known languages recorded by those who still speak it, which could make the effort of preserving languages that much more enjoyable and likely to succeed. Folks are perhaps more likely to watch an overdub of Sorry to Bother You in Ojibwe than to learn it on a lark from dry materials or difficult-to-find conversation partners.

[1]: h/t @erin-rtfm in this podcast interview

junh1024 commented 3 years ago

Another use-case beyond fan edits: The best way to learn a language has always been to listen/watch/expose yourself to that language as much as possible. But some languages don't have much media -- no movies, or film or recorded content. [1] This is especially true of endangered languages. Splitting out audio could allow people to support overdubs in little-known languages

This feature already works without any special tuning for specific scenarios. Please try the default models. See https://github.com/deezer/spleeter/issues/411#issuecomment-638096967

Mar-Pfa commented 3 years ago

short question: i have the same problem here, my model training is running way to fast and when using the model it is just separating nothing... i'm creating a 2-stem approach, i had ~1400 * 3 wave files of 1 minutes for training and last attempt was mp3 - both trainingsets did not work. is there a logfile somewhere?


ok - obsolete - my configuration was not correct now training works

johndpope commented 2 years ago

if you're training seems to be skipping - that's designed on purpose using the cache. INFO:spleeter:Start model training INFO:spleeter:End model training

as a sanity check - to redo training perform some clean up / specifically removing this cache folder!

clean.sh // chmod +x

rm -rf cache
rm -rf jp_model
rm -rf Test
cd pretrained_models
rm -rf jp_model
cd ..
spleeter train -p ./configs/jp_config.json  -d ./train 

For some unknown reason - I have to move the trained model jp_model from root folder into pretrained_models otherwise - it attempts to download file from internet.