Closed frankang closed 6 years ago
I'm not able to reproduce this on IWSLT. Can you share more details about the model architecture, dataset, any other command-line args, and memory usage that you're seeing before/after #33?
Thanks for the reply, I'll temporarily close this issue and may reopen it when I've done more reproducible experiments.
Hi, I just tested and reproduced this issue, just using the sample data and script. If you break the program after each epoch checkpoint, then continue training using the last checkpoint, then the new saved checkpoint file will be almost 80% larger than a standard one. And the graphic card memory consumption increases as well.
The training code I was using is:
python train.py data-bin/iwslt14.tokenized.de-en --lr 1.25 --clip-norm 0.1 --dropout 0.15 --max-tokens 6000 --arch fconv_iwslt_de_en --save-dir . --save-interval 20000 --log-interval 2000 --no-progress-bar --workers 4
And the commit is
commit 30953d8bc1155fcf734b9761de7647819797e2d7
Author: James Reed jamesr66@vt.edu
Date: Tue Oct 24 14:29:52 2017 -0700
Fix for building under clang: specify C++ build and use C++ linkage (#42)
I think this is the confluence of a few things:
1) The data order is shuffled between epochs. 2) PyTorch's caching memory allocator doesn't actually release memory on the GPU, so that it can reuse these buffers later. Depending on the order of the batches, this can cause different memory usage profiles. You can observe this by changing the random seed and seeing the memory usage change. I was able to get the same memory usage when resuming a run by trying a few different seeds when resuming. This is obviously not ideal, and a feature was recently added to PyTorch to free caches on GPU: https://github.com/pytorch/pytorch/pull/3518. I'll experiment with this soon to try to reduce memory usage further. 3) In e432459 we started saving more information about the optimization history in the checkpoints. This is actually a bit excessive now, so I will submit a fix shortly to keep only the most recent optimizer state (instead of all previous optimization states). This is only a small factor though -- most of the increase in memory usage seems to be due to the batch ordering mentioned above.
Thanks for the reply. It seems like the above facts could not lead to the increase in "model size" in both the saved epoch checkpoint (on hard disk) and loaded checkpoint (on GPU). I could see an increase in memory before the data loading. And the model size only increases when one breaks the training then continues it. I would assume it's related to the model loading process, but until now I didn't find the error in the model load process (commits/eea50f38).
It's surprising, but true :) The attached commit should fix the checkpoint size issue.
The increase in GPU memory is related to PyTorch's caching allocator, so will require some more work.
Thanks, yes, it indeed solved the problem. Looks like previously the program accumulates previous historical optimization history when saving and reloading.
@myleott Just came up with an idea that it would be better if the program releases the "last_optimizer_state" memory when it finishes loading the model. Current code keeps it in the GPU memory all the time. Again, thanks for the new commit :)
It should already be released (or at least eligible for garbage collection) after it finishes loading the model, since there are no references kept to last_optimizer_state. However, for performance reasons PyTorch does not actually free the underlying memory on the GPU, thus nvidia-smi
still shows the memory as being "used", even though it's been freed in fairseq-py. The recently-added torch.cuda.empty_cache()
function might help though.
Thanks!
Hi @myleott
I notice that when I commented out the line
#optimizer.load_state_dict(state['last_optimizer_state'])
in the _loadstate function from the utils.py file, the GPU memory consumption drops. Could this be helpful to reduce the memory usage, since we do not need the optimizer state when resuming from an epoch checkpoint? Thanks.
Could this possibly indicate that some references are built when load_state_dict
is executed on the optimizer, and make the "last_optimizer_state"
persistent in the memory? Another explanation is as what you've said in the last thread, that pytorch keeps "last_optimizer_state"
in memory for future use. Whatever, I'll try torch.cuda.empty_cache()
function and report the results.
since we do not need the optimizer state when resuming from an epoch checkpoint?
You should restore the optimizer's state otherwise you'll get different results. For example, optimizers may maintain a momentum buffer that should be restored when resuming from a checkpoint.
Thanks! Forgot about the momentum stuff....
Since the recent updates, it usually requires a 10% more peak memory usage at the point when the program just finished the optimization history loading and starting the remaining train work. Could this be caused by any additional info that's been added to the loaded state or an issue of garbage collection?
This frequently happens when loading from an unfinished checkpoint. And I didn't see it happen before the recent PR33