Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.26k stars 3.38k forks source link

Deepspeed Startegy doesn't set num_checkpoints while using activation partitions #20329

Open Gforky opened 2 weeks ago

Gforky commented 2 weeks ago

Bug description

When training with DeepSpeed and configuring the ZeRO Stage 3 strategy, if activation partitioning is enabled along with contiguous_checkpointing, you may encounter an "index out of range" error related to contiguous_data_buffers. This issue arises because, during the creation of the activation partition configuration, the num_checkpoints parameter is not passed. As a result, DeepSpeed uses the global variable num_layers with its default value of False, which leads to the incorrect creation of an empty contiguous_data_buffers.

What version are you seeing the problem on?

v2.4

How to reproduce the bug

No response

Error messages and logs

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/autograd/function.py", line 574, in apply
[rank0]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 557, in forward
[rank0]:     inputs = partition_activations(args, CPU_CHECKPOINT, CONTIGUOUS_CHECKPOINTING)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/deepspeed/runtime/activation_checkpointing/checkpointing.py", line 421, in partition_activations
[rank0]:     contiguous_data_buffers[i][data_offsets[i]].data[range(
[rank0]: IndexError: list index out of range

Environment

Current environment ``` #- PyTorch Lightning Version: 2.4.0 #- PyTorch Version: 2.4.1 #- Python version: 3.10.6 #- OS: Ubuntu-22.04 #- CUDA version: 12.1 #- GPU models and configuration: A100 #- How you installed Lightning(`conda`, `pip`, source): pip insatll ```

More info

No response

nocoding03 commented 2 weeks ago

Here are some of my opinions: Causes: num_checkpoints Parameter not passed: num_checkpoints is the number of checkpoints used to activate, but is not properly passed to the relevant function during configuration. Default value for num_layers: IndexError occurs because num_layers is not set correctly or the default value is False, causing contiguous_data_buffers to not initialize correctly. Possible solutions: Explicitly pass num_checkpoints parameters: Ensure that the num_checkpoints parameters are correctly passed to the partition_activations function when the activation checkpoint is configured. You can explicitly set this parameter in your DeepSpeed configuration file. For example: config = { "zero_optimization": { "stage": 3, "contiguous_gradients": True, "overlap_comm": True, "reduce_scatter": True, "allgather_partitions": True, "reduce_bucket_size": 5e8, "contiguous_checkpointing": True, "num_checkpoints": 5 # Add this line to explicitly pass the num_checkpoints parameter }, "activation_checkpointing": { "partition_activations": True, "contiguous_memory_optimization": True, "cpu_checkpointing": True } }