Open Gforky opened 2 weeks ago
Here are some of my opinions: Causes: num_checkpoints Parameter not passed: num_checkpoints is the number of checkpoints used to activate, but is not properly passed to the relevant function during configuration. Default value for num_layers: IndexError occurs because num_layers is not set correctly or the default value is False, causing contiguous_data_buffers to not initialize correctly. Possible solutions: Explicitly pass num_checkpoints parameters: Ensure that the num_checkpoints parameters are correctly passed to the partition_activations function when the activation checkpoint is configured. You can explicitly set this parameter in your DeepSpeed configuration file. For example: config = { "zero_optimization": { "stage": 3, "contiguous_gradients": True, "overlap_comm": True, "reduce_scatter": True, "allgather_partitions": True, "reduce_bucket_size": 5e8, "contiguous_checkpointing": True, "num_checkpoints": 5 # Add this line to explicitly pass the num_checkpoints parameter }, "activation_checkpointing": { "partition_activations": True, "contiguous_memory_optimization": True, "cpu_checkpointing": True } }
Bug description
When training with DeepSpeed and configuring the ZeRO Stage 3 strategy, if activation partitioning is enabled along with contiguous_checkpointing, you may encounter an "index out of range" error related to contiguous_data_buffers. This issue arises because, during the creation of the activation partition configuration, the num_checkpoints parameter is not passed. As a result, DeepSpeed uses the global variable num_layers with its default value of False, which leads to the incorrect creation of an empty contiguous_data_buffers.
What version are you seeing the problem on?
v2.4
How to reproduce the bug
No response
Error messages and logs
Environment
Current environment
``` #- PyTorch Lightning Version: 2.4.0 #- PyTorch Version: 2.4.1 #- Python version: 3.10.6 #- OS: Ubuntu-22.04 #- CUDA version: 12.1 #- GPU models and configuration: A100 #- How you installed Lightning(`conda`, `pip`, source): pip insatll ```More info
No response