Deepspeed Startegy doesn't set num_checkpoints while using activation partitions

Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.

Apache License 2.0

28.26k stars 3.38k forks source link

Bug description

When training with DeepSpeed and configuring the ZeRO Stage 3 strategy, if activation partitioning is enabled along with contiguous_checkpointing, you may encounter an "index out of range" error related to contiguous_data_buffers. This issue arises because, during the creation of the activation partition configuration, the num_checkpoints parameter is not passed. As a result, DeepSpeed uses the global variable num_layers with its default value of False, which leads to the incorrect creation of an empty contiguous_data_buffers.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 557, in forward
[rank0]:     inputs = partition_activations(args, CPU_CHECKPOINT, CONTIGUOUS_CHECKPOINTING)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 421, in partition_activations
[rank0]:     contiguous_data_buffers[i][data_offsets[i]].data[range(
[rank0]: IndexError: list index out of range

Environment

Current environment

``` #- PyTorch Lightning Version: 2.4.0 #- PyTorch Version: 2.4.1 #- Python version: 3.10.6 #- OS: Ubuntu-22.04 #- CUDA version: 12.1 #- GPU models and configuration: A100 #- How you installed Lightning(`conda`, `pip`, source): pip insatll ```

More info

No response

Here are some of my opinions: Causes: num_checkpoints Parameter not passed: num_checkpoints is the number of checkpoints used to activate, but is not properly passed to the relevant function during configuration. Default value for num_layers: IndexError occurs because num_layers is not set correctly or the default value is False, causing contiguous_data_buffers to not initialize correctly. Possible solutions: Explicitly pass num_checkpoints parameters: Ensure that the num_checkpoints parameters are correctly passed to the partition_activations function when the activation checkpoint is configured. You can explicitly set this parameter in your DeepSpeed configuration file. For example: config = { "zero_optimization": { "stage": 3, "contiguous_gradients": True, "overlap_comm": True, "reduce_scatter": True, "allgather_partitions": True, "reduce_bucket_size": 5e8, "contiguous_checkpointing": True, "num_checkpoints": 5 # Add this line to explicitly pass the num_checkpoints parameter }, "activation_checkpointing": { "partition_activations": True, "contiguous_memory_optimization": True, "cpu_checkpointing": True } }

Lightning-AI / pytorch-lightning