[Discussion] Today I ran out of memory after running training for 35 hours (spectrogram processing)

dustyny commented 2 years ago

I'm working on a large data set (500GB, 100k examples) tuned for splitting drums.

Over the past day and a half I've watched spleeter slowly and steadily take more and more RAM. I think I've loaded up about 128GB of WAVs (assuming the on disk caching is a 1.5 x increase in size compared to the source)... I started to get out of memory errors when I used 72 out of the 84GB of available ram.

When I checked out the dataset.py file, I do think I found that an iterator keeps adding to a variable. Now I'm wondering what is the next best step to handle this. I have some tight constraints on RAM when using a GPU in Google cloud. So I can't just load up with 256GB of ram and call it a day.

I'm confused as to why spleeter does caching to disk but doesn't clear out ram. Can I recover from this cache somehow? What should I expect for RAM sizing?

Obviously my next step is to cut down the training set but eventually I'd like to train it against the full set to see if that produces more accurate results.

Would love some insights as to how I should setup the env. I've burned through about $2k of VM/GPU time.. not something I can keep doing.

romi1502 commented 2 years ago

Hi @dustyny, I've not trained a full spleeter model with a lot of data with the most recent versions of tensorflow, but I'm pretty sure you're not supposed to need that much RAM for training. So there should be something wrong in the data pipeline that causes some memory leak. The most obvious thing that could explain the issue would be caching in RAM. If you don't provide a path to the cache method of tensorflow dataset then it will cache in RAM, which in your setting is not possible (and I guess not necessary), so make sure to have filled the "training_cache" (and "validation_cache") parameters in your config file. Otherwise, it may also be some tensorflow behavior that has changed from version 1, and that the pipeline is not handling well, and we may not be able to help on this in the short term.

dustyny commented 2 years ago

It's don't think it's RAM caching, I set the parameters and I see the training and validation folders have nearly 200GB of data.

The code is a bit hard to read but I think I see an iterator in main line 80 calling data into a variable but nothing that clears it.. I've tried tracing it in dataset.py but I keep getting lost, there's iterators calling iterators, calling iterators and reusing the variable name "dataset" for everything has made it a real mess to try to follow.

deskstar90 commented 2 years ago

Hi, I'm not sure but I think I may be in a similar situation when I train and I'm not sure if its a bug. I have re-trained 3 times so far using MUSDB16HQ dataset on 2 different Win10 PC's (one has 12Gb the other 24Gb) and 23 - 24hrs later, it just stops training and hangs at the same file (patrick talbot - set me free) wave file. I'm not sure if I'm running out of ram to be able to complete the last 1/4 but this is on 2 different PC's where my CPU resources are ~100%, I willl be maxxing out my ram shortly and re-train again just to see if it hangs at the same file again, Have you investigated if its tensorflow related? thx

dustyny commented 2 years ago

Have you tried redownloading the MUSDB16HQ data set? Getting stuck on one file makes me think it has some sort of corruption. I haven't investigated if it's a Tensorflow related issue, I don't know the framework well enough to do that.. I think my issue was due to a VM with limited RAM mixed with some iterator that collects something in memory.. TBH I moved on to a newer OSS project.. I spent to much time and $$ trying to work this out with Spleeter..

deskstar90 commented 2 years ago

@dustyny I don't think its corrupt, it plays well in audacity and both my folders have the correct number of files (train=100, test=50). something is preventing it from completing the task.. and its not lack of space but I think I can rule out the hardware because its on 2 computers. Just curious as to how many people have actually been able to train 100% on musdb

dustyny commented 2 years ago

@deskstar90 it sounds like we have very different issues.. I haven’t used the musdb set but I do know it’s very common.

deskstar90 commented 2 years ago

I'm not sure that ram can distinguish what type of dataset you're using, whatever it is, train will still fill the available ram. I've tried to train at least 6x now with the same results where it just hangs. it may not be flushing the contents of Ram to be able to continue to complete the training, I don't know that's my impression. I will try using only half of the dataset to see what happens

deezer / spleeter

[Discussion] Today I ran out of memory after running training for 35 hours (spectrogram processing) #722