Open lucellent opened 4 years ago
Hey @lucellent and thank you for your interest in Demucs. While I would be super happy to train a higher quality model, the limiting resource is not computational power or time but tracks with the individual stems broken down. This is the most limiting resource, and if you have ideas on where to get more, I'd be happy to train new versions of the Demucs with it. For electronic music it is now possible to buy those, but its fairly expensive, as well as probably illegal to use for machine learning due to the terms and conditions. The only other alternative is squeezing better models out of the tracks we already have, and that's an ongoing research effort with no clear solution for now.
Thanks for the insight adefossez. Fortunately I know a place with tons of multitracks from modern and maybe older songs, let me know how I can help. Stems get posted daily so there's lots to pick from. I know everyone would be very appreciative if you can train demucs with more data. There is an unreleased software for machine learning vocal separation that currently does the best job for metal/rock songs and sounds as official as it can get right now (but so far can only output instrumental, not other tracks) so I feel like there should be competition.
That would be very helpful indeed! Feel free to email me so we can discuss this place (defossez at fb.com). One last point as I said is the legal rights, even if the stems are posted there, whether I can use them is a gray area unless they are clearly released in Creative Commons, or I collect written consent from the artists. It is quite important to verify this if I'm going to train and release the model using Facebook resources.
@lucellent I think it's a great idea that you want to feed the network with more data, and thus improve the overall quality. But I think that the following scenarios (at least in my tests) should also be considered:
For one thing, the separation quality of mono sources is very poor. Here you should provide data accordingly and build test cases for it. So far, separation seems to be mainly based on phase differences. However, this is also a problem if, for example, the stereo image is disproportionately large.
On the other hand, the channels should have different weightings. For example, a bass signal is only recognized when it has a certain dominance. Quiet signals are simply ignored. Or in the case of voice, the result can be very bad, depending on the mixing ratio. Here it would be especially good to build different mixes for the voice, for example. For one thing, examples are needed where the voice is very dominant and the background is quiet, and vice versa. Apropo voice, here are some more strange cases. In some of my tests a saxophone was included. This was falsely recognized as voice. Well in the frequency spectrum there is a certain similarity in fact, but actually this is wrong. In my opinion, test cases would be needed for something like that as well.
Furthermore, additional drum recordings are required. In the high frequencies there is often only a very unclean separation, especially with the voice and the S-sounds. So there is also more data missing here.
Do these requirements give you a lot of test data? It would indeed be great to see how the model behaves if many more special cases and problems were covered. Especially some percussion instruments like bongos are not recognized as drums at the moment. Probably because these were simply not included in the training data.
I'm not sure, I would expected the separation quality to be much better on the train set. But just checking the separation quality would likely give a lot of false positive. A more robust mechanism is for right owners to add a sort of watermark that will poison any model trained with it, i.e. it become very easy and reliable to determine if the model used the watermarked data, see [1] for more details. I don't think it is used by right owners just yet, but in the near future it might very well start to be the case.
[1] https://ai.facebook.com/blog/using-radioactive-data-to-detect-if-a-data-set-was-used-for-training/
I have now tried out several models in the last months to separate signals from each other. These were the most different, too many to list them all here.
My conclusion about the passage of time was that this model somehow always worked best and with the least artifacts. Even though I was trained relatively little data, the results are clearly superior to other models.
Which is still very important here. My comparison was based on the original model "demucs" and not the light version. This version was not as good as the original one and unfortunately did not work properly in many cases. All these tests are based on purely personal feelings, without much mathematical comparison.
Therefore I want to check in here regarding an extended training. I think you should be able to improve the quality even more if you continue training with a larger dataset. Are there any new approaches how to do this? Unfortunately, due to the things described above, e.g. a lot of money, I don't have much to contribute myself. But I think that this model could definitely have a future because it is clearly superior to most other models - in my opinion.
Thanks for the insight adefossez. Fortunately I know a place with tons of multitracks from modern and maybe older songs, let me know how I can help. Stems get posted daily so there's lots to pick from. I know everyone would be very appreciative if you can train demucs with more data. There is an unreleased software for machine learning vocal separation that currently does the best job for metal/rock songs and sounds as official as it can get right now (but so far can only output instrumental, not other tracks) so I feel like there should be competition.
Hi, has this software been released yet? Thanks. And, can I just ask so I can get a clue where to look, is the place to get multitracks you mention a private tracker?
@adefossez Hi! I red in another issue thread that you said that is not possible to add custom data (other stems and mixture) to train the model, so I'm confused because here and in another thread you say that training with extra data would improve model's quality. My question is: is it possible or not? If someone else knows about this topic can answer me? Thank you very much for this incredible job!! Cheers. Maxi
It is technically feasible, just not easy at all with the current pipeline. I plan on adding support for that, hopefully this month or the next, as I need anyway to add support for the Musdb HQ dataset, so I might as well support arbitrary training sets.
@adefossez thanks for the quick answer! Oh that would be amazing! When You do support for this, let me know if I can help with the training data. I know a place in the web which has more or less 160/180 stems of pop/rock songs. The only I should have to do is to put them in a DAW and generate the mixture for each of them. I hope you understand me, English is not my natal language. Cheers!
Hey @adefossez, thanks for your great work here! If the limiting factor is more data, did you consider trying to make a synthetic dataset by generating stems with software instruments and then combining them to create training data? I'm not a music/sound domain expert at all so forgive me if there's an obvious reason this wouldn't work that I'm missing! Could be another easy way to get more pattern types (other than bass, vocals, etc), as well as to create more varied data by adding various combinations of effects to the sounds?
@yovizzle This sort of thing has already been done. Check out the massive Slakh2100 data set. They actually used sample-based VIs to try to make the MIDI more organic sounding which I think is pretty cool!
@adefossez Looking through the documentation it looks like you have switched to exclusively supporting the MusDB HQ dataset, so I'm wondering if you've also updated this added support for arbitrary training data?
I'm also interested in training Demucs on a more extensive training set; the Slakh set I mentioned for one, but I also know many musicians and I'm curious about curating my own training data.. though I wouldn't want to put time into that if it's not going to work :)
To be clear, are you saying that adding labels beyond the current 4 stems is unsupported, or that Demucs is optimized specifically for Musdb, and training with any other dataset is unsupported, even if it is organized the same way as Musdb (same 4 categories)?
Is there going to be any further model training? I feel like where it's now it's relatively good but can be improved much better. For example I'd like the data to be trained more on modern songs that have background vocals and different types of drums rather than those demucs is currently trained on.
We see a lack of ability to extract background vocals from the synths and also most modern songs that have drums with reverb get picked up without the reverb.
I'm not saying I'm not happy with demucs, but it could be improved a lot. I sadly don't have the necessary gear to be able to train it myself, nor do I know much about coding.