-
Hello,
Currently I am trying to run qlora.py script with the 65B model on 2 A100 40GB GPUs with the script
```accelerate launch qlora.py --args```
with ```--args``` the ones given in the rep…
-
I hit upon an error in HuggingFace for which there are strangely zero google search results
"ValueError: Calculated loss must be on the original device" I can see this error source code in huggingf…
-
### Bug description
A basic example with MNIST breaks when using the FSDP strategy if using `automatic_optimization=False` + explicit calls to `manual_backward(loss)`.
The error seems to stem from…
-
### System Info
torch 2.4.1
transformers 4.46.0.dev0
trl 0.11.2
peft 0.13.1
GPU V100
CUDA …
-
Multi GPU train flux, Is the script of flux train not supported now?
-
We need to set the test script for our training pipeline.
- Data generation: @hungphongtrn
- [ ] Check the audio generated (audio match the prompt)
- [ ] Check the integrity of audio files wi…
-
Hi, i am trying to run the SFT step, using 4 A100 80GB, report error: `starting 4 processes for FSDP training setting RLIMIT_NOFILE soft limit to 1048576 from 1048576 /opt/conda/lib/python3.8/multipr…
-
**Is your feature request related to a problem? Please describe.**
Checkpointing is significantly faster with Torch Distributed's async checkpoint feature: https://pytorch.org/docs/stable/distributed…
-
### 📚 The doc issue
The function signature is `optim_state_dict(model, optim, optim_state_dict=None, group=None)` but the example is calling `optim_state_dict = FSDP.optim_state_dict_to_load(optim_st…
-
### System Info
A100 Nvidia 80G GPU
### Who can help?
_No response_
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] An officially supported task…