-
## š Bug
Trying to test simple `xm.send` and `xm.recv` gives error.
## To Reproduce
Steps to reproduce the behavior:
1. Run test code below
```
import torch
import torch_xla.core.xlā¦
-
### š The feature, motivation and pitch
https://github.com/pytorch/pytorch/issues/75255 implemented the ability to ignore FSDP parameters at the module level, i.e. by passing in an `ignore_module` liā¦
-
After the addition of Qwen2, we have multiple models using a transformer decoder class with head weights tied to embedding weights (Gemma and Qwen2). The class `TiedEmbeddingTransformerDecoder` is intā¦
-
### š The feature, motivation and pitch
FSDP optimizer overlap is in https://github.com/pytorch/pytorch/pull/98667 needs some follow up work:
- We reallocate the _cpu_grad for CPU offload every itā¦
-
### š The feature, motivation and pitch
For implementing things like Alibi, we need a tensor in our model that is the same on each rank, is small, and never changes. This is very hard to do in FSDP.
ā¦
-
### š Describe the bug
When passing in a module as `ignored_modules`, should we also ensure FSDP does not initialize it via `to_empty` + `reset_parameters`? If `ignored_modules` contract is that FSDPā¦
-
the tutorial and document haven't mentioned that parts, I have tried to torchrun 4 threads load llama2 with model parallelism, but failed with fsdp togather.
-
https://github.com/pytorch/torchtitan/pull/161/files#diff-80b04fce2b861d9470c6160853441793678ca13904dae2a9b8b7145f29cd017aR269
IIRC @awgu mentioned there was an issue requiring this setting forā¦
-
Currently, the DTensor tensor subclass manages a `_local_tensor` attribute that represents the local tensor on the given rank. For efficient all-gather/reduce-scatter, we prefer to have a padded localā¦
awgu updated
7 months ago
-
**Context**
To compose per-parameter-sharding FSDP with `DTensor`-based tensor parallelism, we need to reshard an existing `DTensor` to its parent mesh and include the FSDP dim-0 sharding.
The curā¦
awgu updated
9 months ago