Open gchlebus opened 6 years ago
RE: Why doesn't gradients_memory save any memory?
Memory strategy heuristic works by selecting articulation points. This seems to be the wrong approach for U-net. The main part of the network doesn't have any articulation points.
There are probably some articulation points on the edges of the network that network chooses. Bad choice of checkpoints can result in a strategy that uses more memory than original graph, so I wouldn't use gradients_memory
here, but instead use manual checkpoints. Choice of checkpoints for this network needs some thought.
As to why apply_gradients
takes 25MB more memory than minimize
, maybe it's something to do with TensorFlow internal optimizers that also rewrite things for improved memory usage. You could use mem_util to plot timeline of tensors and figure out the difference. Also you could turn off optimizers as below
def create_session():
optimizer_options = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)
config = tf.ConfigProto(operation_timeout_in_ms=150000, graph_options=tf.GraphOptions(optimizer_options=optimizer_options))
config.graph_options.rewrite_options.constant_folding = rewriter_config_pb2.RewriterConfig.OFF
config.graph_options.place_pruned_graph = True
return tf.Session(config=config)
I dug a bit into the minimize
function. It turns out, that the difference in peak memory consumption between minimize
and tf.gradients
is caused by the fact, that minimize
(or more precisely compute_gradients
, which is called internally) calls tf.gradients
with gate_gradients=True
. This is not the case, when calling directly tf.gradients
. Moreover, calling gradients_memory(loss, tf.trainable_variables(), gate_gradients=True)
results in 66 MB peak memory usage, which is, indeed, the lowest score.
I would be interested, whether a manual selection of checkpoints in the U-net architecture would allow to reduce the peak memory usage even further. How would you choose the checkpoints?
I did a peak memory vs batch size plot for the U-Net model using tf.gradients
and gradients_memory
. I found the slope increase at batch size=3 for the gradients_memory
interesting. Could it be, that the automatic checkpoint selection depends on the batch size?
Nope, automatic selection depends on layout of computation graph, and batch size doesn't change computation graph (it just changes size of individual nodes).
So why don't OpenAI implement a similar strategy of swap_memory
for memory_saving_gradients
? I'd wager that swapping GPU memory to and from host is faster than recomputation.
@netheril96 swapping is slow, it's 7-10x faster to recompute on GPU for most ops
@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!
@yaroslavvb May I ask a tangential question?
What tool have you used to create that UNet graph? It looks awesome, so I want to learn to use that tool too.
@netheril96 that one I just screenshotted from U-Net paper. Not sure what tool they used for it, but could be done easily in Omnigraffle which is what I used for diagrams in the blog post
@yaroslavvb Oh. I was hoping for an automatic tool to generate beautiful graph from code. Tensorboard visualizations are too ugly. Thanks anyway.
@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!
As far as I remember I put one checkpoint at the lowest u-net level. This had no difference in terms of speed or memory consumption when compared to the default checkpoint locations.
@gchlebus how did you add the checkpoints? I am trying this:
output = Block(nfi=64, fs=(5,5,5))(prev_output) # Block with 3D convolutions
tf.add_to_collection('checkpoints', output)
But when I assign tf.dict["gradients"] = memory_gradients it does not find anything and raises an Exception.
I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py
What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the
mem_util.py
. Memory usage:tf.train.AdamOptimizer().minimize(loss)
: 75 MBtf.gradients(loss, tf.trainable_variables())
+optimizer.apply_gradients()
: 107 MBgradients_memory(loss, tf.trainable_variables())
+optimizer.apply_gradients()
: 96 MBI would have two questions:
tf.train.AdamOptimizer.minimize
? Am I using the memory saving gradients wrongly?minimize
function doestf.gradients
+optimizer.apply_gradients()
.I would greatly appreciate your feedback.