-
FSDP is toolkit for distributed model training. It is an alternative to Deepspeed. The InstructLab team has added support for FSDP in addition to DeepSpeed in their training repo and we would like to …
-
```
7: [rank80]: urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)
```
Running FSDP example, 16 p5 nodes. The example w…
-
On `PP + FSDP` and `PP + TP + FSDP`:
- Is there any documentation on how these different parallelisms compose?
- What are the largest training runs these strategies have been tested on?
- Are there…
-
### System Info
```Shell
Latest main version, torch nightly, cuda 12.6
```
### Information
- [ ] The official example scripts
- [ ] My own modified scripts
### Tasks
- [ ] One of t…
-
Hey, thanks for the great project. Very excited about using it.
When doing the post-install I noticed that some internal torch distributed code seems to be patched and I was wondering what was the …
-
Hi, I'm wondering how I should be thinking of the mixed precision policies of these three packages together. My plugin is below. It works, but I don't think we're doing things right with the mixed_pre…
-
Hi,
Does the Shampoo implementation support HuggingFace's Accelerate library?
Can it be used in:
`model, optimizer, scheduler = accelerator.prepare(model, optimizer, scheduler)` ?
Thanks!
-
Hi all, first of all, thanks for your great work!
I have issue when trying to use the optimizer with FSDP training.
The error is
` optimizer = DistributedShampoo(
File "/root/slurm/src/opti…
-
### Reminder
- [x] I have read the README and searched the existing issues.
### Reproduction
Is LLaMa-Factory capable of FSDP QDoRa described here:
https://www.answer.ai/posts/2024-04-26-fsdp-qdor…
-
### 🚀 The feature, motivation and pitch
The fine-tuning with only FSDP works well and sharded checkpoints are saved as `__0_*.distcp, .metadata, and train_params.yam`l. I can see the loss drop reas…