facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.37k stars 6.4k forks source link

Out of memory error when train from a state checkpoint #51

Closed frankang closed 6 years ago

frankang commented 6 years ago

Since the recent updates, it usually requires a 10% more peak memory usage at the point when the program just finished the optimization history loading and starting the remaining train work. Could this be caused by any additional info that's been added to the loaded state or an issue of garbage collection?

This frequently happens when loading from an unfinished checkpoint. And I didn't see it happen before the recent PR33

myleott commented 6 years ago

I'm not able to reproduce this on IWSLT. Can you share more details about the model architecture, dataset, any other command-line args, and memory usage that you're seeing before/after #33?

frankang commented 6 years ago

Thanks for the reply, I'll temporarily close this issue and may reopen it when I've done more reproducible experiments.

frankang commented 6 years ago

Hi, I just tested and reproduced this issue, just using the sample data and script. If you break the program after each epoch checkpoint, then continue training using the last checkpoint, then the new saved checkpoint file will be almost 80% larger than a standard one. And the graphic card memory consumption increases as well.

The training code I was using is: python train.py data-bin/iwslt14.tokenized.de-en --lr 1.25 --clip-norm 0.1 --dropout 0.15 --max-tokens 6000 --arch fconv_iwslt_de_en --save-dir . --save-interval 20000 --log-interval 2000 --no-progress-bar --workers 4 And the commit is commit 30953d8bc1155fcf734b9761de7647819797e2d7 Author: James Reed jamesr66@vt.edu Date: Tue Oct 24 14:29:52 2017 -0700

Fix for building under clang: specify C++ build and use C++ linkage (#42)
myleott commented 6 years ago

I think this is the confluence of a few things:

1) The data order is shuffled between epochs. 2) PyTorch's caching memory allocator doesn't actually release memory on the GPU, so that it can reuse these buffers later. Depending on the order of the batches, this can cause different memory usage profiles. You can observe this by changing the random seed and seeing the memory usage change. I was able to get the same memory usage when resuming a run by trying a few different seeds when resuming. This is obviously not ideal, and a feature was recently added to PyTorch to free caches on GPU: https://github.com/pytorch/pytorch/pull/3518. I'll experiment with this soon to try to reduce memory usage further. 3) In e432459 we started saving more information about the optimization history in the checkpoints. This is actually a bit excessive now, so I will submit a fix shortly to keep only the most recent optimizer state (instead of all previous optimization states). This is only a small factor though -- most of the increase in memory usage seems to be due to the batch ordering mentioned above.

frankang commented 6 years ago

Thanks for the reply. It seems like the above facts could not lead to the increase in "model size" in both the saved epoch checkpoint (on hard disk) and loaded checkpoint (on GPU). I could see an increase in memory before the data loading. And the model size only increases when one breaks the training then continues it. I would assume it's related to the model loading process, but until now I didn't find the error in the model load process (commits/eea50f38).

myleott commented 6 years ago

It's surprising, but true :) The attached commit should fix the checkpoint size issue.

The increase in GPU memory is related to PyTorch's caching allocator, so will require some more work.

frankang commented 6 years ago

Thanks, yes, it indeed solved the problem. Looks like previously the program accumulates previous historical optimization history when saving and reloading.

frankang commented 6 years ago

@myleott Just came up with an idea that it would be better if the program releases the "last_optimizer_state" memory when it finishes loading the model. Current code keeps it in the GPU memory all the time. Again, thanks for the new commit :)

myleott commented 6 years ago

It should already be released (or at least eligible for garbage collection) after it finishes loading the model, since there are no references kept to last_optimizer_state. However, for performance reasons PyTorch does not actually free the underlying memory on the GPU, thus nvidia-smi still shows the memory as being "used", even though it's been freed in fairseq-py. The recently-added torch.cuda.empty_cache() function might help though.

frankang commented 6 years ago

Thanks!

frankang commented 6 years ago

Hi @myleott I notice that when I commented out the line #optimizer.load_state_dict(state['last_optimizer_state']) in the _loadstate function from the utils.py file, the GPU memory consumption drops. Could this be helpful to reduce the memory usage, since we do not need the optimizer state when resuming from an epoch checkpoint? Thanks.

frankang commented 6 years ago

Could this possibly indicate that some references are built when load_state_dict is executed on the optimizer, and make the "last_optimizer_state" persistent in the memory? Another explanation is as what you've said in the last thread, that pytorch keeps "last_optimizer_state" in memory for future use. Whatever, I'll try torch.cuda.empty_cache() function and report the results.

myleott commented 6 years ago

since we do not need the optimizer state when resuming from an epoch checkpoint?

You should restore the optimizer's state otherwise you'll get different results. For example, optimizers may maintain a momentum buffer that should be restored when resuming from a checkpoint.

frankang commented 6 years ago

Thanks! Forgot about the momentum stuff....