gradients_memory require more memory than tf.Optimizer.minimize

gchlebus commented 6 years ago

I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py

What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the mem_util.py. Memory usage:

tf.train.AdamOptimizer().minimize(loss): 75 MB
tf.gradients(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 107 MB
gradients_memory(loss, tf.trainable_variables()) + optimizer.apply_gradients(): 96 MB

I would have two questions:

How come that the memory saving gradients require more memory than tf.train.AdamOptimizer.minimize? Am I using the memory saving gradients wrongly?
Why the peak memory usage between 1st and 2nd bullet point differ? I thought, that the minimizefunction does tf.gradients + optimizer.apply_gradients().

I would greatly appreciate your feedback.

yaroslavvb commented 6 years ago

RE: Why doesn't gradients_memory save any memory?

Memory strategy heuristic works by selecting articulation points. This seems to be the wrong approach for U-net. The main part of the network doesn't have any articulation points.

There are probably some articulation points on the edges of the network that network chooses. Bad choice of checkpoints can result in a strategy that uses more memory than original graph, so I wouldn't use gradients_memory here, but instead use manual checkpoints. Choice of checkpoints for this network needs some thought.

As to why apply_gradients takes 25MB more memory than minimize, maybe it's something to do with TensorFlow internal optimizers that also rewrite things for improved memory usage. You could use mem_util to plot timeline of tensors and figure out the difference. Also you could turn off optimizers as below

def create_session():
  optimizer_options = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)
  config = tf.ConfigProto(operation_timeout_in_ms=150000, graph_options=tf.GraphOptions(optimizer_options=optimizer_options))
  config.graph_options.rewrite_options.constant_folding = rewriter_config_pb2.RewriterConfig.OFF
  config.graph_options.place_pruned_graph = True
  return tf.Session(config=config)

gchlebus commented 6 years ago

I dug a bit into the minimize function. It turns out, that the difference in peak memory consumption between minimize and tf.gradients is caused by the fact, that minimize (or more precisely compute_gradients, which is called internally) calls tf.gradients with gate_gradients=True. This is not the case, when calling directly tf.gradients. Moreover, calling gradients_memory(loss, tf.trainable_variables(), gate_gradients=True) results in 66 MB peak memory usage, which is, indeed, the lowest score.

I would be interested, whether a manual selection of checkpoints in the U-net architecture would allow to reduce the peak memory usage even further. How would you choose the checkpoints?

gchlebus commented 6 years ago

I did a peak memory vs batch size plot for the U-Net model using tf.gradientsand gradients_memory. I found the slope increase at batch size=3 for the gradients_memory interesting. Could it be, that the automatic checkpoint selection depends on the batch size? memory_batchsize

yaroslavvb commented 6 years ago

Nope, automatic selection depends on layout of computation graph, and batch size doesn't change computation graph (it just changes size of individual nodes).

netheril96 commented 6 years ago

So why don't OpenAI implement a similar strategy of swap_memory for memory_saving_gradients? I'd wager that swapping GPU memory to and from host is faster than recomputation.

yaroslavvb commented 6 years ago

@netheril96 swapping is slow, it's 7-10x faster to recompute on GPU for most ops

danieltudosiu commented 5 years ago

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

netheril96 commented 5 years ago

@yaroslavvb May I ask a tangential question?

What tool have you used to create that UNet graph? It looks awesome, so I want to learn to use that tool too.

yaroslavvb commented 5 years ago

@netheril96 that one I just screenshotted from U-Net paper. Not sure what tool they used for it, but could be done easily in Omnigraffle which is what I used for diagrams in the blog post

netheril96 commented 5 years ago

@yaroslavvb Oh. I was hoping for an automatic tool to generate beautiful graph from code. Tensorboard visualizations are too ugly. Thanks anyway.

gchlebus commented 5 years ago

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

As far as I remember I put one checkpoint at the lowest u-net level. This had no difference in terms of speed or memory consumption when compared to the default checkpoint locations.

kuonb commented 5 years ago

@gchlebus how did you add the checkpoints? I am trying this:

output = Block(nfi=64, fs=(5,5,5))(prev_output) # Block with 3D convolutions
tf.add_to_collection('checkpoints', output)

But when I assign tf.dict["gradients"] = memory_gradients it does not find anything and raises an Exception.

cybertronai / gradient-checkpointing

gradients_memory require more memory than tf.Optimizer.minimize #11