-
## 🐛 Bug
Gemma-7b with FSDP zero3 trained on 2 nodes with 8 H100 each gives OOM error for BS = 2 for both `thunder_cudnn` and `thunder_inductor_cat_cudnn`. The same configuration works for `inducto…
-
## 🐛 Bug
When benchmarking model: 'Mixtral-8x7B-v0.1' we get OOM errors even with --checkpoint_activations True
The same configurations works for torch.compile.
Might be related to [https://gi…
-
FairScale FullyShardedDataParallel (FSDP) API supports large model training and is being quickly adopted by internal and external users. The long term goal of upstreaming the API to PyTorch is to rele…
-
### Willingness to contribute
No. I cannot contribute this feature at this time.
### Proposal Summary
This feature request proposes to add support for logging FullyShardedDataParallel models …
-
My understanding is that FSDP does not shard the model buffers, and as a result, unlike parameters which would be fred and go back to their sharded state after state_dict()/summon_full_params(), this …
-
It's not obvious how one should instantiate an optimizer with groups after instantiating `FSDP`.
The change in the linked PR #538 breaks the unittests.
The examples/docs should either denote that …
-
When working on a model with FSDP wrapping, running into an illegal memory access crash and it went away with flatten=False. I will be debugging it.
-
## 🚀 Feature
FSDP to offer the possibility to compute the norms of the weights and norms of the gradients on the fly, when the weights / gradients are available with an option like `compute_weight_…
-
We are training text_to_image on Google cloud platform, the jupyterlab instance has 2 GPUs (NVIDIA Tesla P100) with a total memory of 32GB (16GB each). I tried using accelerate for training the text_t…
-
### 🐛 Describe the bug
Hi,
When using `torch.compile` and `torch._dynamo.compiled_autograd` to trace the FSDP model with the backward gradient hooks, the following error happened. According to t…