-
Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply `NCCL_P2P_DISABLE=1`, but this negatively i…
-
We have GPU cluster nodes with 8 * H100 and 4*400 RoCE. I try nccl test on this cluster with the same nodes. But I find tree bus bandwidth(150GB/s) is slower than ring bandwidth (190GB/s). From my…
-
### 🔎 Search before asking
- [X] I have searched the PaddleOCR [Docs](https://paddlepaddle.github.io/PaddleOCR/) and found no similar bug report.
- [X] I have searched the PaddleOCR [Issues](https://…
-
### What happened + What you expected to happen
[Microbenchmark](https://github.com/ray-project/ray/blob/master/python/ray/_private/ray_experimental_perf.py#L150) results for a single-actor acceler…
-
## 🚀 Feature
Make streams used for NCCL operations configurable
## Motivation
I've noticed that PyTorch distributed module has introduced P2P send and receive functionality via NCCL (which is…
wjuni updated
3 weeks ago
-
Hi, we recently observed that when running with NCCL_ALGO=Tree,NCCL_PROTO=Simple. NCCL fallback to Ring,LL with broadcast. It seems like NCCL_PROTO is ignored when there is no ALGO/PROTO pair found fo…
-
06/26 11:07:50 - mmengine - INFO -
------------------------------------------------------------
System environment:
sys.platform: linux
Python: 3.10.12 (main, Nov 20 2023, 15:14:05)…
-
### System Info
Following on [Philip's blogpost to conduct FSDP + QLoRA in SageMaker](https://www.philschmid.de/sagemaker-train-deploy-llama3)
* Training script is the [default one](https://github…
-
`$ python train_ddgan.py --dataset cifar10 --exp ddgan_cifar10_exp1 --num_channels 3 --num_channels_dae 128 --num_timesteps 4 --num_res_blocks 2 --batch_size 64 --num_epoch 1800 --ngf 64 --nz 100 --z_…
-
**Describe the bug**
Running the Pythia-7B fine-tune script on 4 x A10 (24GB each).
Seems like issue with seq len:
_```
Token indices sequence length is longer than the specified maximum seque…