nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

ray-project/ray #39471

[Ray GPU collectives] NCCL internal error on aws.G5 node

Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply `NCCL_P2P_DISABLE=1`, but this negatively i…

cadedaniel updated 1 month ago
5
NVIDIA/nccl #1368

NCCL test, Tree is slower than Ring

We have GPU cluster nodes with 8 * H100 and 4*400 RoCE. I try nccl test on this cluster with the same nodes. But I find tree bus bandwidth(150GB/s) is slower than ring bandwidth (190GB/s). From my…

wangdaw2023 updated 2 months ago
2
PaddlePaddle/PaddleOCR #13864

SVTRv2 微调，运行一段时间后异常

### 🔎 Search before asking - [X] I have searched the PaddleOCR [Docs](https://paddlepaddle.github.io/PaddleOCR/) and found no similar bug report. - [X] I have searched the PaddleOCR [Issues](https://…

kerry-weic updated 1 week ago
4
ray-project/ray #45319

[core][experimental] Higher than expected overhead for share…

### What happened + What you expected to happen [Microbenchmark](https://github.com/ray-project/ray/blob/master/python/ray/_private/ray_experimental_perf.py#L150) results for a single-actor acceler…

stephanie-wang updated 2 weeks ago
13
pytorch/pytorch #67158

Make streams used for NCCL operations configurable

## 🚀 Feature Make streams used for NCCL operations configurable ## Motivation I've noticed that PyTorch distributed module has introduced P2P send and receive functionality via NCCL (which is…

wjuni updated 3 weeks ago
10
NVIDIA/nccl #1304

NCCL fallback to Ring,LL on broadcast perf and NCCL_ALGO=Tre…

Hi, we recently observed that when running with NCCL_ALGO=Tree,NCCL_PROTO=Simple. NCCL fallback to Ring,LL with broadcast. It seems like NCCL_PROTO is ignored when there is no ALGO/PROTO pair found fo…

arttianezhu updated 3 months ago
1
InternLM/xtuner #792

单机多卡训练卡住，日志也看不出问题

06/26 11:07:50 - mmengine - INFO - ------------------------------------------------------------ System environment: sys.platform: linux Python: 3.10.12 (main, Nov 20 2023, 15:14:05)…

apachemycat updated 1 month ago
2
huggingface/transformers #31675

QLORA + FSDP distributed fine-tuning failed at the end durin…

### System Info Following on [Philip's blogpost to conduct FSDP + QLoRA in SageMaker](https://www.philschmid.de/sagemaker-train-deploy-llama3) * Training script is the [default one](https://github…

Neo9061 updated 3 days ago
6
NVlabs/denoising-diffusion-gan #28

The NCCL error

`$ python train_ddgan.py --dataset cifar10 --exp ddgan_cifar10_exp1 --num_channels 3 --num_channels_dae 128 --num_timesteps 4 --num_res_blocks 2 --batch_size 64 --num_epoch 1800 --ngf 64 --nz 100 --z_…

mapengsen updated 1 year ago
2
togethercomputer/OpenChatKit #102

Token indices sequence length is longer than the specified m…

**Describe the bug** Running the Pythia-7B fine-tune script on 4 x A10 (24GB each). Seems like issue with seq len: _``` Token indices sequence length is longer than the specified maximum seque…

tginart updated 1 month ago
3

上一页 1...13 14 15 16 17 18 19...100 下一页

1000+ results for nccl

1000+ results
for nccl