Closed jona-0 closed 2 years ago
Just waiting to see what @SeanNaren says but to me it looks like your observations are right!
I am happy and keen to raise a pr with this fix in, and add some tests* but wanted to run this approach past someone before opening the pr as I am not familiar with this codebase.
That would be awesome! Feel free to bring this in, high-five!
As for the tests, imo a simple test that the deepspeed.initialize
call receives the config with the correct setting from us should be sufficient, as the performance and memory usage will be a direct consequence of the third-party functioning correctly (which is tested in their library).
Just wondering if I should wait for another comment on this issue or to open the PR now (currently in draft state). Will assume I should open it tomorrow, but wanted to check in case there was something else I should be doing first.
Thanks a lot @jona-0 this looks great, apologies on the delay! Please open the PR for reviews :)
A strange case for sure (it does seem in DeepSpeed the variables are different https://www.deepspeed.ai/docs/config-json/#activation-checkpointing). The PR you've opened will fix the issue:)
🐛 Bug
We expect when the cpu_checkpointing flag is set, GPU memory usage to be constant during the forward pass (as it offloads each layers activations to the CPU) see https://www.deepspeed.ai/docs/config-json/#activation-checkpointing but it does not do this.
I suspect this is due to a typo in DeepSpeedConfig – we set cpu_checkpointing but try to read checkpoint_in_cpu.
To Reproduce
Run once with --cpu_checkpointing, once with --checkpoint_in_cpu. Observe that cpu_checkpointing does not change the GPU memory usage, but checkpoint_in_cpu does
Expected behavior
cpu_checkpointing should enable cpu offloading of the checkpoints and therefore constant GPU memory usage between layers, but we can see that GPU memory usage increases through the layers:
If we run with the suggested fix we can see GPU memory usage is constant between layers.
Environment
Suggested fix
I think the minimal solution here is to change: checkpoint_in_cpu=checkpoint_config.get("checkpoint_in_cpu"), to checkpoint_in_cpu=checkpoint_config.get("cpu_checkpointing"), of line 530 in pytorch_lightning/plugins/training_type/deepspeed.py
As far as I can tell, nothing sets checkpoint_in_cpu in the config in the lightning codebase, so this looks like a typo. Deepspeed is confusing here because the flag you set in the deepspeed config is cpu_checkpoint, but the argument to deepspeed.checkpointing.configure is called checkpoint_in_cpu.
I am happy and keen to raise a pr with this fix in, and add some tests* but wanted to run this approach past someone before opening the pr as I am not familiar with this codebase.
* I think the test would look something like: Train two BoringModel with multiple layers, checkpointing and an on_before_backward hook to store GPU memory usage. The model trained with cpu_checkpointing should have significantly lower peak GPU memory usage.
cc @SeanNaren @awaelchli