-
### 🐛 Describe the bug
I'm trying to use the DCP API, but am finding that models that own nn.Parameters are failing when there is more than one device.
The following code will fail with
```raise…
-
### 🐛 Describe the bug
FSDP
- [ ] FSDP, autocast (MosaicML Diffusers) https://github.com/pytorch/pytorch/issues/110797
- **Error raised:** aot_autograd, r.grad = self.meta_tensor
- [ ] FSDP,…
-
### Context
Today, `FullyShardedDataParallel` (FSDP) supports meta device initialization via two paths, where the precondition is that the `module` passed to FSDP has some parameter on meta device:
…
-
Oftentimes, one wants to do a more general `n`-way data parallelism, `m`-way model parallelism as helpfully explained in the official JAX [docs](https://jax.readthedocs.io/en/latest/notebooks/Distribu…
-
### 🐛 Describe the bug
When running the test suite of PyTorch 1.12.1 I get (e.g.)
```
distributed/fsdp/test_fsdp_input failed!
distributed/fsdp/test_fsdp_mixed_precision failed!
```
Tracing …
-
https://github.com/pytorch/pytorch/blob/935f6977542affc0d16c66333a13d60dae6aa5fa/torch/distributed/fsdp/wrap.py#L561
When calling FSDP class recursively, `ignored_param` (which is supposed to be pa…
-
### 🐛 Describe the bug
### When we ignore modules with any trainable parameters in FSDP, an error occurs when we try to continue training after loading a distributed checkpoint for the optimizer.
…
-
We are training text_to_image on Google cloud platform, the jupyterlab instance has 2 GPUs (NVIDIA Tesla P100) with a total memory of 32GB (16GB each). I tried using accelerate for training the text_t…
-
This issue is to track a few follow-ups regarding `ignored_modules`.
1. Users may want to ignore specific parameters or buffers within a module. How should we modify the API to accommodate this?
2…
-
Similar to DDP, we can add FSDP logging data API to expose FSDP internal states, performance metrics and meta infos.
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen …