-
Hello,
We saw the issue that a broadcast tensor from a single-dimension parameter is marked sharded by XLA sharding propagator. This sharded tensor, while doing computation with other tensor which ha…
-
### Proposal to improve performance
On the current main:
```shell
$ python benchmarks/benchmark_throughput.py --input-len 100 --output-len 100 --num-prompts 100 --model facebook/opt-125m -tp 2 …
-
Imagine the following scenario:
* I'm running an MPI-based SPMD programming model that owns main() and was launched by `mpirun`, called MPI_Init(), etc.
* I want to create a Chapel library that ca…
-
I would like to ask an example for initializing and updating a model using nnx.jit with SPMD.
Is there any relevant example?
-
### Motivation.
**TL;DR**: Introduce SPMD-style control plane to improve control plane architecture and optimize performance.
For distributed inference, vLLM currently leverages a “driver-worker”,…
-
I found that XLA auto-sharding is based on the Alpa paper https://arxiv.org/abs/2201.12023 which proposed an algorithm for inter-op and intra-op parallelism. However, it appears that it implements onl…
-
### Description
### **Concept introduction**
The fact that SPMD has no scheduling overhead gives it the best performance, but it is often not easy enough to develop complex training tasks. For exa…
-
## ❓ Questions and Help
While in SPMD mode If we run the train command of a model on all the VMs together (single program multiple machines) each VM has its own dataloader using cpu cores.
Then, wh…
-
### Description
By doing sufficiently-exciting capture + sharding' + mapping behavior, it is possible to induce jax's batching to witness inconsistent sizes for the batch axis. The following code s…
jkr26 updated
2 months ago
-
**Describe the bug**
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
**Your hardware and system info**
Write your system info like CUDA version/system/GPU/torc…