Open kncrane opened 3 years ago
Hello,
Did you run into any OOM errors from graph growth when you were running these scripts or do you have any insights?
I've never experienced any OOM (I was using Google Colab to train the VAE) and I haven't use that code for a while now (I've switched to PyTorch).
The only thing I can provide you is the code I'm now using to train the AE (just made it public, I should open source the rest too): https://github.com/araffin/aae-train-donkeycar
I've been intending to switch over to PyTorch for months. Ok thank you will check it out!
Describe the bug
I'm running
python -m vae.train --n-epochs 50 --verbose 0 --z-size 64 -f logs/images_generated_road_single_colour/
and getting an errorAllocator (GPU_0_bfc) ran out of memory trying to allocate 1.17GiB (rounded to 1251090432).
after so many training iterations.I'v added checkpoint saving using the
save_checkpoint()
function in vae/model.py. The training run was crashing after so many iterations when it tried to create the .meta file for the checkpoint but I got around that by addingwrite_meta_graph=False
to thesaver.save
functionWhen I was still getting an OOM error I added
self.sess.graph.finalize()
to the_init_session()
function in vae/model.py to make the graph read only and catch any changes to the graph. An exception was raised from the linevae_controller.set_target_params()
in vae/train.py, which in turn callsassign_ops.append(param.assign(loaded_p))
from withinset_params()
in vae/model.pyWas reading this article https://riptutorial.com/tensorflow/example/13426/use-graph-finalize---to-catch-nodes-being-added-to-the-graph and the memory leak I am getting sounds most like their third example .. "subtle (e.g. a call to an overloaded operator on a tf.Tensor and a NumPy array, which implicitly calls tf.convert_to_tensor() and adds a new tf.constant() to the graph)."
Did you run into any OOM errors from graph growth when you were running these scripts or do you have any insights? Cheers Antonin
Code example This is my training loop section from vae/train.py (validation been added). The last line is the problem line ..
This is the edited _init_session() from vae/model.py ..
And this is the supposed source of memory leak, within vae/model.py ..
Let me know if you want full scripts
System Info Describe the characteristic of your environment:
Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15210 MB memory) -> physical GPU (device: 0, name: NVIDIA Tesla P100-PCIE-16GB, pci bus id: 0001:00:00.0, compute capability: 6.0)
Additional context This is the full error message