-
### System Info
```Shell
- `Accelerate` version: 0.33.0
- `accelerate` bash location: /miniconda3/envs/SDXL/bin/accelerate
- Python version: 3.10.14
- Numpy version: 1.24.4
- PyTorch version (…
-
### System Info
Output from `transformers-cli env`:
```
- `transformers` version: 4.45.2
- Platform: Linux-6.1.0-21-cloud-amd64-x86_64-with-glibc2.36
- Python version: 3.12.5
- Huggingfa…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
### Exp…
-
### 🐛 Describe the bug
Running the `test_fsdp_tp_integration` with a number of GPUs that is (likely) not a power of 2 fails with e.g.:
```
torch.testing._internal.common_distributed: [ERROR] File…
-
Support whole model activation offloading with FSDP - working in conjunction with activation checkpointing - via
https://github.com/pytorch/pytorch/blob/e9ebda29d87ce0916ab08c06ab26fd3766a870e5/to…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
### Exp…
-
# assert not (train_args.fsdp and train_args.gradient_checkpointing), "currently, we don't support both options. open an issue for details."
why??
-
We recently had an incident where there was an accidental temporary regression with LoRA training due to differences in DeepSpeed and FSDP support - we want to add this type of training to the E2E cov…
-
only occur when using 8 bit adam
with FSDP1 i run into:
FSDP config
param_dtype: bf16
reduce_dtype: fp32
```
Traceback (most recent call last):
File "", line 198, in _run_mo…
-
### Bug description
I'm using FSDP and model checkpointing (default settings for both). My model has 254 million parameters. I'm not sure why but when I run Trainer.fit() it will successfully run t…