-
**Describe the bug**
There is a misalignment of volumes being provisioned in multi-AZ clusters. This causes volsync job-pods to be unscheduleable.
On my non multi-AZ cluster, volsync pods are …
-
Hi @AkihiroSuda ! :wave:
I want to introduce you to @lisejolicoeur, who has joined our team this summer (with @milroy) to work specifically on Usernetes networking! We are opening this issue to sh…
-
## Use Case
Make managing muti-node buses easier, for example:
1. image latent and VAE -- a 2-channel reroute bus;
2. positive and negative prompt -- a 2-channel (or 4-channel with L and G variants…
-
It is not clear from the documentation and the sample code, if the forecast generation can be performed on a GPU, multiple GPUs, or multiple GPUs in multiple nodes. If this is the case, please add som…
-
### Check before submitting issues
- [X] Make sure to pull the latest code, as some issues and bugs have been fixed.
- [X] I have read the [Wiki](https://github.com/ymcui/Chinese-LLaMA-Alpaca-3/wiki)…
-
Hello there,
First i would like to extend my many thanks to you for setting up this amazing repo !
I'm currently working on a project with the aim to release the largest clean arabic text datase…
-
@awaelchli I found that in the `pretrain.py`, the accumulation steps are calculated based on global batch size, device number and micro batch size.
This works fine under single-node setting, e.g. glo…
-
Hi.
I have been running NCCL_TESTS on a multi-node, multi-GPU environment with NCCL 2.19.3-1 and OpenMPI 4.1.6. Each node has 4 NVIDIA V100 GPUs interconnected with NVLink and PCIe.
1. How is th…
-
### 📚 Describe the documentation issue
Currently, [training_benchmark_xpu.py](https://github.com/pyg-team/pytorch_geometric/blob/master/benchmark/multi_gpu/training/training_benchmark_xpu.py) only su…
-
### System Info
NCCL version 2.19.3+cuda12.0
TensorRT-LLM version: 0.11.0.dev2024052100
Ubuntu 22.04
### Who can help?
@byshiue
### Information
- [X] The official example scripts
- [ ] My o…