-
1. **Prerequisite:** Make sure the LLM Inference framework can be launched following the SPMD style. For example, the LLM inference script can be launched by `torchrun --standalone --nproc=8 offline_i…
-
### Problem Description
On Llama3 70B Proxy Model, the training stalls & gpucore dumps. The gpucore dumps are 41GByte per GPU thus i am unable to send it. Probably easier for yall to reprod this er…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
###…
-
### System Info
PyTorch 2.4.0, Cuda 12.1, CentOS HPC cluster with 7x H100 GPUs
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### 🐛 Describe the bug
```bash
FSD…
-
```
[rank0]: File "/opt/venv/lib/python3.10/site-packages/torch/distributed/_composable/fsdp/_fsdp_param.py", line 653, in all_gather_inputs
[rank0]: ) = sharded_local_tensor.fsdp_pre_all_gath…
-
### Please check that this issue hasn't been reported before.
- [X] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.
###…
-
### System Info
transformers==4.45.2
### Who can help?
@ArthurZucker
### Information
- [ ] The official example scripts
- [X] My own modified scripts
### Tasks
- [ ] An officially supported tas…
-
When running the [FSDP sample app](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/README-EKS.md) on HyperPod EKS cluster, I got this error.
```
[W CUDAFu…
-
Hello,
Along the issue here https://github.com/evo-design/evo/issues/11 which discusses finetuning codes for Evo, I am specifically looking for information on which frameworks could be used to opti…
-
### 🐛 Describe the bug
when trying to train both LoRA layers on the base model and also set modules_to_save on the lora config which makes the embeddings layers trainable (my assumption is it also ap…