cybertronai / gradient-checkpointing

Make huge neural nets fit in memory
MIT License
2.69k stars 270 forks source link

gradients_memory require more memory than tf.Optimizer.minimize #11

Open gchlebus opened 6 years ago

gchlebus commented 6 years ago

I would like to use the memory saving gradients to train a U-net model with bigger patches or/and increased batch size. I implemented a toy example to assess the memory usage when switching from tf.Optimizer.minimize to the memory saving gradients: https://github.com/gchlebus/gchlebus.github.io/blob/ca55f92d816ebe4659721b61e1a1f4f3b5c3e4f1/code/profiling-tf-models/u_net.py

What I surprisingly found out, is that the memory gradients require more memory than tf.Optimizer.minimize, but less memory than tf.gradients. I queried the peak memory usage using the mem_util.py. Memory usage:

I would have two questions:

  1. How come that the memory saving gradients require more memory than tf.train.AdamOptimizer.minimize? Am I using the memory saving gradients wrongly?
  2. Why the peak memory usage between 1st and 2nd bullet point differ? I thought, that the minimizefunction does tf.gradients + optimizer.apply_gradients().

I would greatly appreciate your feedback.

yaroslavvb commented 6 years ago

RE: Why doesn't gradients_memory save any memory?

Memory strategy heuristic works by selecting articulation points. This seems to be the wrong approach for U-net. The main part of the network doesn't have any articulation points.

screenshot 2018-01-22 09 40 17

There are probably some articulation points on the edges of the network that network chooses. Bad choice of checkpoints can result in a strategy that uses more memory than original graph, so I wouldn't use gradients_memory here, but instead use manual checkpoints. Choice of checkpoints for this network needs some thought.

As to why apply_gradients takes 25MB more memory than minimize, maybe it's something to do with TensorFlow internal optimizers that also rewrite things for improved memory usage. You could use mem_util to plot timeline of tensors and figure out the difference. Also you could turn off optimizers as below

def create_session():
  optimizer_options = tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)
  config = tf.ConfigProto(operation_timeout_in_ms=150000, graph_options=tf.GraphOptions(optimizer_options=optimizer_options))
  config.graph_options.rewrite_options.constant_folding = rewriter_config_pb2.RewriterConfig.OFF
  config.graph_options.place_pruned_graph = True
  return tf.Session(config=config)
gchlebus commented 6 years ago

I dug a bit into the minimize function. It turns out, that the difference in peak memory consumption between minimize and tf.gradients is caused by the fact, that minimize (or more precisely compute_gradients, which is called internally) calls tf.gradients with gate_gradients=True. This is not the case, when calling directly tf.gradients. Moreover, calling gradients_memory(loss, tf.trainable_variables(), gate_gradients=True) results in 66 MB peak memory usage, which is, indeed, the lowest score.

I would be interested, whether a manual selection of checkpoints in the U-net architecture would allow to reduce the peak memory usage even further. How would you choose the checkpoints?

gchlebus commented 6 years ago

I did a peak memory vs batch size plot for the U-Net model using tf.gradientsand gradients_memory. I found the slope increase at batch size=3 for the gradients_memory interesting. Could it be, that the automatic checkpoint selection depends on the batch size? memory_batchsize

yaroslavvb commented 6 years ago

Nope, automatic selection depends on layout of computation graph, and batch size doesn't change computation graph (it just changes size of individual nodes).

netheril96 commented 6 years ago

So why don't OpenAI implement a similar strategy of swap_memory for memory_saving_gradients? I'd wager that swapping GPU memory to and from host is faster than recomputation.

yaroslavvb commented 6 years ago

@netheril96 swapping is slow, it's 7-10x faster to recompute on GPU for most ops

danieltudosiu commented 5 years ago

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

netheril96 commented 5 years ago

@yaroslavvb May I ask a tangential question?

What tool have you used to create that UNet graph? It looks awesome, so I want to learn to use that tool too.

yaroslavvb commented 5 years ago

@netheril96 that one I just screenshotted from U-Net paper. Not sure what tool they used for it, but could be done easily in Omnigraffle which is what I used for diagrams in the blog post

netheril96 commented 5 years ago

@yaroslavvb Oh. I was hoping for an automatic tool to generate beautiful graph from code. Tensorboard visualizations are too ugly. Thanks anyway.

gchlebus commented 5 years ago

@gchlebus I am working with a VAE which is roughly the same as the U-NET I was wondering where did you put the checkpoints? Thanks!

As far as I remember I put one checkpoint at the lowest u-net level. This had no difference in terms of speed or memory consumption when compared to the default checkpoint locations.

kuonb commented 5 years ago

@gchlebus how did you add the checkpoints? I am trying this:

output = Block(nfi=64, fs=(5,5,5))(prev_output) # Block with 3D convolutions
tf.add_to_collection('checkpoints', output)

But when I assign tf.dict["gradients"] = memory_gradients it does not find anything and raises an Exception.