facebookresearch / demucs

Code for the paper Hybrid Spectrogram and Waveform Source Separation
MIT License
8.26k stars 1.05k forks source link

Lighter version of the model #81

Open 0xBEEEF opened 4 years ago

0xBEEEF commented 4 years ago

In the readme file in this repository you will find the following sentence:

I'll soon publish a lighter version of the model that should run with less RAM.

This in itself is a bit longer ago. I just wanted to find out how it is. It would be really great if this lightweight model would be released as announced.

adefossez commented 4 years ago

Hey @0xBEEEF , I'm currently under a heavy work load due to the ICML deadline coming up. I will release the lighter models after that. Sorry for the wait...

adefossez commented 4 years ago

@0xBEEEF , took some time today to upload the lighter version ;) you can use -n light instead of -n demucs and it will use the light version. Light version is 1GB to download and then about 4x times faster than the normal one.

0xBEEEF commented 4 years ago

That's just great! You're really a fast guy. I also think it's great that you have made a few more optimizations.

But now I still have a few (maybe stupid) questions:

  1. if I don't want to separate 4 sources as suggested, but only 2, will the load be reduced and I will need less memory? I would only be interested in the separation between speech/voice and the accompaniment, like Spleeter does.

  2. question about the training duration. Unfortunately, I can't afford the high-end graphics cards described in the readme to achieve a similar speed as you do when training. But do you have any approximate guidelines for the amount of time you need to invest in a normal high-end graphics card, such as the Geforce RTX 2080? It doesn't have to be exact to the second, just an approximate guide value.

  3. is the further development of the model planned? I have tried various models so far, and this model delivers outstanding results, especially with very strongly overlapping signals. The results are very impressive. I have only noticed that, for example, very deep male voices are recognized as "bass", and some speech sounds are also recognized as "drum". But all in all this model is ingenious! And the loss of quality is not as strong as with the spectrogram based ones.

All in all, congratulations to you and your whole team for developing this great model.

adefossez commented 4 years ago

Thanks for the feedback :)

  1. That would require training a new model. It could reduce the size of the model or speed but not that much I think, in particular compared with the light model. The bulk of the model is shared by all sources and only the last layer has specific weights per output source.

  2. You can train on a single GPU but indeed it might take a bit of time. You will have to use a small number of channels (48 or 64), a batch size of 4. I estimate the training time to be of the order of a week in that case. You can train a slightly smaller number of epochs than me (60 to 80 epochs will get you most of the perf, maybe even less because a smaller batch size means more iterations per epoch). Use --split_valid, this will limit the amount of memory used at evaluation time. If you start the training but interrupt it before completion, just use the exact same command line flags + --save_model. This will load the checkpoint and save the current best model in a corresponding file under the models folder which you can then use with demucs.separate.

  3. I'm getting close to the end of my PhD. I have other projects to complete but I might still work a bit on the topic. Sadly I don't have the bandwidth to make this into a complete, nice package with docs etc.

junh1024 commented 4 years ago

RE: The lighter model is this about reducing the numerical precision of the tensors?

  1. Also, a Voice + others model would be nice.