-
Hi,
Thanks for the amazing repo!
I want to use the MOE but cannot find an exmple. Is it possible to provide an tutorial/example to show how to use moe? For example, how to define main, training …
-
### Bug description
When running multi-node/multi-GPU training with different number of GPUs on each node, `Fabric` `ddp` and `fsdp` will have an incorrect `num_replicas` in `distributed_sampler_kwar…
-
Hi, i'm trying to train fully sharded transformer. At the beginning, I started to train the model with use_shard_state=False, but it failed when tried to save the checkpoint, since there are several f…
-
## ❓ Questions and Help
[distributed support of deepspeed on xla] Hello, does deepspeed support distributed training for xla? If not, can you provide support in this regard?
-
Here is a list of flaky tests that we should fix in our next fix-a-thon.
- test_shared_weight_mevo[optim_state-flat]
- test_regnet[pytorch-flatten-mixed]
- test_shared_weight_mevo[train-none]
- …
-
### 🚀 The feature, motivation and pitch
DDP bucket will always in GPU HBM,which size is same as the sum of module all weight gradients' size.
In fwd stage and optimizer stage, this memory is wast…
-
Repost from Slack as requested:
First my environment if it is relevant is Jamf cloud with a secondary local SMB FSDP. The MacBook running AutoPkg is an M1 running Ventura.
I'm getting [Errno 2] …
-
With FSDP currently the code could not be run.
If you try to add model compilation to the [training](https://github.com/Lightning-AI/lit-llama/blob/main/train.py) like:
```
...
fabric = L.Fabric…
-
### 🐛 Describe the bug
I was trying to use torch.compile + FSDP + huggingface transformer. I was able to make it work on one GPU, however, on 8 A100 GPUs, I ran into the following errors. I made a re…
-
### 🚀 The feature, motivation and pitch
FSDP optimizer checkpoint loading expects params to be keyed by FQN, but DDP saves checkpoints with param IDs.
FSDP does provide `rekey_optim_state_dict` to…