Training consumes over 60 GB of memory

bfs18 / nsynth_wavenet

parallel wavenet based on nsynth

106 stars 30 forks source link

Training consumes over 60 GB of memory #16

Open michaelkrzyzaniak opened 6 years ago

michaelkrzyzaniak commented 6 years ago

Cool project, thanks for making it available. I pulled the code and the LJSpeech dataset. I prepared the dataset and began training with the default parameters, using the commands at the top of the readme. After printing the line

INFO:tensorflow:Calculate initial statistics.

Python3's memory usage grew to almost 30 GB. After the initial statistics were calculated, the memory usage dropped back to about 1 or 2 GB, and then after

INFO:tensorflow:global_step/sec: 0

It rose steadily to 60 (sixty) GB, at which point my OS killed it. Is this normal? The saved model checkpoint is only 1.2 GB.

I'm using Mac OSX 10.13.4 (High Sierra), Python 3.6.5, tensorflow 1.9.0 (cpu only), librosa 0.6.1. I had similar results on an Ubuntu 14 machine using TensorFlow GPU, where I killed the program after it reached 32 GB.

michaelkrzyzaniak commented 6 years ago

On further inspection, I think this is just because the model was very large by default. config_jsons/wavenet_mol.json has

"num_stages": 10, "num_layers": 30,

as compared to https://github.com/ibab/tensorflow-wavenet/blob/master/wavenet_params.json which has by default

"dilations": [1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
              1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
              1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
              1, 2, 4, 8, 16, 32, 64, 128, 256, 512,
              1, 2, 4, 8, 16, 32, 64, 128, 256, 512
              ],

Would this correspond to this in your project:

"num_stages": 10, "num_layers": 5,

Is that correct? Running your code this way, the memory usage seems to be bounded by about 20GB peak. (By comparison, TensorFlow-wavenet stays around 5 GB.) Large but workable.

bfs18 commented 6 years ago

Hi, I have a desktop (32 GB ram, gtx1080ti, 8700k, ubuntu 16.04, python 3.6, tensorflow 1.8). I run train_wavenet.py with the default config, and it never exceeds memory limits. The memory usage also depends on batch size. You can start with a small batch size. The default config in tensorflow-wavenet should corresponds to

"num_stages": 10,
"num_layers": 50,

If the initialization consume too much memory, juse set data_dep_init_fn = None in train_wavenet.py