Closed dustyny closed 1 year ago
Oh I didn't realize this was hardcoded for the Musicdb set.
@adefossez I'm willing to help generalize it.. looks like others are interested as well.. Also happy to share how I built my data set and the results.
I used v100 for training. I was using 32GB versions, you could get away with 16GB by trimming down the model. CPUs are not that important, except for pitch/tempo augment there is no heavy processing.
you will need space for storing checkpoints, but not that much (2gb per experiment). you will not need extra space as long as your dataset is presented in the right format (e.g. wavs in folders).
wav is streamed for each batch training. overall ram usage is not that big.
overall nb of GPU, GPU speed and amount of memory on GPU are the key factors. you can easily use multi gpu if they are on the same machine (-d
flag in dora run
). multi node training is not really supported out of the box outside of Facebook cluster.
@adefossez I really appreciate the information.
From my perspective, no matter if my model works great at extracting drum stems or creates harsh digital experimental sounds, it's win-win for me (i'm an industrial musician).. =D
If I follow what you're doing with the wav processing (repitch, remix, scale, timestretch); I've taken a different approach that should have accomplished the same end. I wrote a random loop builder that generated hundreds of thousands of different drum loops one for each category type. These loops mix and match tempos and don't adhere to a time signature or grid; that's one premise I'm testing.. I think the drum sounds cover the frequency ranges (base kicks being between 50-100hz) for each type so I don't think it will be necessary to repitch them. I also have plenty of examples and can scale up pretty easily to make more, so I don't think I need to remix.
If I want to bypass the wav preprocessing, it looks like I have to use the following k/v in the yaml file.. is this correct?
augment:
shift_same: false #<---- is this supposed to be true or false?
scale:
proba: 0
remix:
proba: 0
repitch:
proba: 0
for the shift_same
arguments it should probably be false if want to skip any kind of data augmentation.
@dustyny I would love a short write-up, if you could, on building a data set - I think others would appreciate it as well. I think I see most of the pieces but would appreciate anything you could provide on it :)
@mr-segfault i realized my first data set had some errors like clipping. I had to rewrite the generator script, so I made some improvements.. this one runs much slower though (almost a 10x) but it also generates a lot more data . I’ll wait until I’ve verified the approach before writing up anything to in-depth.
So what I did was take thousands of single hit drum samples and a python based sampler (musicpy), then sequenced hundreds of thousands of 12 second loops. I randomly mix different classes of percussion into a track. So my ‘brass” section might have a open and closed hat, with a crash. Drums are a mix of bass drums, congos, to,a, etc. I then randomize the pitch, stereo balance and overall volume. Finally creating the mixed loop. All of the assets are saved, so I can mix them around easily to create more.
I’m hoping to start training this weekend and expect it to run for a couple weeks, but I’ll try to use multi-GPU if possible to speed it up
@dustyny that's awesome, thank you for that. I appreciate the insight; quite clever about stitching loops together. Couple Q's if you don't mind -- (if you don't have time / capacity no prob) Are there any desirable file attributes with respect to source training data such as file type (flac/wav) or sample rate, bit depth for training purposes ? (IE: it's optimal to be a wav 44100 16 bit) or does it not matter?)
Also with a file length of 12 seconds, is that sufficient for training purposes? (IE: is there any gains to make with longer?)
And last, do you know if there is a training diminishing rate of return? (IE: I was going assemble a few hundred to a few thousand sets of stems + masters, but was wondering when that might stop yielding better results)
What I was hoping to do:
I have Native Instruments Kontakt (and numerous libraries) and a very large amount of MIDI that I was going to hand jamb through my DAW; can crank out a tune quite quickly, perhaps a few minutes at most (clearly it's not professionally mixed or mastered). If I can get away with much shorter portions of songs, I think that makes the process a bit easier overall to crank out more numbers.
I think if I make a training folder with the parts + master from that exercise, I suspect 500-1000 or so 'songs' should be a possible improvement to the model but who knows, it might be under the hood stuff that has to improve vs. the model!
TBH I don't know what will work best for this model, I'm more of a data engineer then scientist.
From what I've found demucs (and similar models) split up the song into short segments (10-12 secs). Since I'm focused on drums I didn't care about how a melody might evolve over time, so I haven't tried to figure that out.. Some of the other models I looked through didn't seem to care about continuity, they grabbed a bunch of sections of a song. So my assumption is the model doesn't care about how one segment flows into the next.. more about what do I need to do to split this one 10 sec segment back into its stems.. But I could be absolutely wrong about that.. I think it's worth testing regardless..
I think your approach makes sense.. If you know python/lua (they have their own lang EEL2 as well) I'd take a look at Reaper.. they have an API that could help you automate the process. Load up random bass sound, select some 8 bar midi loops and bounce each channel to wav.. If you put a midi key remapper (say you remap notes to F Major), you wont need to know what key the midi loop is in.. Also if you have loops broken up by type (bass, arps, etc) you can save some rendering time if you just use them (you can always time stretch or chop them to length) You can even add in efx into the chain as a follow up iteration if you find the model doesn't do well with reverb, flangers etc..
https://www.reaper.fm/sdk/reascript/reascript.php
I assume we don't need to worry about it being mixed properly as long as you're not clipping the channels. My premise is that real music has a huge variety of frequencies, dynamics, balance, etc, so variety is probably more impactful.
Awesome information. Thank you kindly!
I will give it a go with shorter audio segments. Good thinking with respect to dynamics / frequencies being 'real-world' as a benefit, I like that premise- it seems reasonable!
Good tip about scripting reaper - I was going to try AbletonLive / Max4Live (as that's my current DAW) but if that goes south, I like that you shared the reaper option; appreciated!
I happen to have a handful of already made projects that I might manually start with (or at least fiddle with) preparing an initial set and seeing if I can get the model training to process it end to end; after that's verified to do 'something', circle back with Max4Live and if that gets too complicated take a look at Reaper.
Appreciate the reply, dustyny!
@dustyny : do you plan to publish either your drum loop stem creation toolchain and/or your model? if so, where could I subscribe to be notified?
I had to take a bit of a diversion to get my data sorted out.. The drum sample libraries I've collected over the years have a lot of mislabeled files. I'm using a few commercially produced products to curate the set.
For my loop generator.. I can release it on github once I validate the approach. I could be wrong in my assumption is drum timing can be all over the place in real music, it doesn't have to follow the bar & note divisions. Until I can prove that this code might just produce unusable junk.
That will have to wait a little while I'm really caught up in building & cleaning the data set (I have a few hundred thousand 1 shot drum samples). I don't get much time to work on this project, so I need to stay focused until the model is built.
In the meantime it's a pretty easy thing to roll your own. I used Librosa to open the file as a numpy array. I randomly place the sample data into a 10 second long numpy array filled with 0.0s. WIth this approach and Ray for multi-processing I can generate a few hundred thousand loops in a few hours.
sample = wav_in_np_array empty_np_10s = np_zeros list_of_array_cordinates = random_select(1,9) #how many hits to put in array list_of_note_lengths = sample_chop_to_note_length() # returns sample in smaller chunks sample[:1/2note]
loop = empty_np_10s[coordinate:] + list_of_note_lengths[random] #add a random sample chunk to the np_array save_loop_to_file(loop)
@dustyny how did your training model end up?
@thelatebphelium It didn't go very well.. I ran into so many issues by the time (weeks of effort) I finally got it trained it didn't perform at all.. I gave up, didn't have the energy to keep going.
I'd like to create a model that I can use to demix drum loops. I have 500 GB that contains over 100k of examples (600k WAVs), each is 13 seconds long, 16 bit, 44.1 khz stereo.
drumloop_mixed = { bass_drum, snares_claps, tuned_drums, brass_percussion & other_sounds }
What should I change (config file, cli parameters) to enable me to do customize for extracting the stems/tracks listed above?
Does this model support splitting 5 stems?
What file structure do I need to have in place or do I need to create a JSON or a CSV with the file locations?
Do I need to have a scratch or caching disk? If so how much space should I plan for ( such as 2 x dataset size) ?
What should I expect in regards of RAM? Will I need a very large amount of RAM or will it stream the WAV data off disk?
Does demucs use multi-cpu's? Given I have so many wav files does it make sense to give it a large number of CPUs (64, 96) to speed up preprocessing?
Will demucs use the GPU right away or does it start by processing the files, using the CPU and then switch to the GPU once preprocessing is done?
Can I use multiple GPUs to speed up training? Is there any limits on what type of GPUs (I'm thinking Nvidia V100s)?
Is there anything that I didn't ask but I should know?
Much appreciated, can't wait to get this experiment running.. 🤓