-
This issue proposes that the Chapel language / module code be adjusted to allow for certain SPMD idioms within forall loops. This proposal is intended to meet both the needs of programmers wishing to …
-
## ❓ Questions and Help
Hi, I recieved loss None when training model. Anyone can help?
Simple reproduct kaggle notebook [link](https://www.kaggle.com/code/liondude/notebook548442067d)
```
im…
-
I see there is a DeviceMesh abstraction in `spmd`: https://github.com/pytorch/PiPPy/blob/main/spmd/tensor/device_mesh.py
Can we use this abstraction as shared infrastructure? For example, `Pipeline…
-
## 🚀 Feature
Check that the graphs generated by PyTorch FSDP are SPMD.
## Motivation
We have encountered scenarios for distributed training where the graphs generated by PyTorch are not SPMD. I…
-
## ❓ Questions and Help
I'm running this official [script here](https://github.com/pytorch/xla/blob/master/test/test_train_mp_imagenet_fsdp.py), but I only see two xla devices being used, xla:0 and…
-
Co-authored with @SolitaryThinker @Yard1 @rkooo567
We are landing multi-step scheduling (#7000) to amortize scheduling overhead for better ITL and throughput. Since the first version of multi-step…
-
When compiling with c++23 the following errors are reported in cppspmd_sse.h:
```
In file included from /media/dezlow/Drive/Dev/C++/Oneiro/ThirdParty/KTX/lib/basisu/encoder/basisu_kernels_sse.cpp:…
-
Change this line https://github.com/alpa-projects/alpa/blob/ea50a4328064a2a4eeae9101b65058a21ba112b8/tests/pipeline_parallel/test_mlp.py#L34
from `jax.grad` to `alpa.grad`. I got this error
```
WAR…
-
SPMD-zation is required for good performance and the copy constructor used for non-trivial types can cause us to miss out on it.
While SPMD-zation has conceptual limitations right now, the cases I'…
-
spmd has a normal training speed using eight blocks on a single machine, but the communication overhead increases rapidly in the case of multiple machines
device is:
gpu:A100 * 8 * 2
spmd strategy …