nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

pytorch/pytorch #39706

[distributed] calling nccl reduce with inconsistent dst hang…

Not sure if intended or avoidable, but if `dst` is inconsistent across ranks, `reduce` finishes, but future kernels seems to hang. E.g., ```py import torch torch.distributed.init_process_group('ncc…

ssnl updated 4 years ago
1
NVIDIA/nccl #1151

Could GPUDirect RDMA used between two machines?

Hello, I am curioused about the use of GPUDirect RDMA across physical machines. My setup involves two physical machines, each equipped with one GPU and an RDMA-capable network card, which is not on…

duanzhaol updated 9 months ago
3
KellerJordan/modded-nanogpt #15

Error running on 8xH100, works on 4xH100?

Am attempting to reproduce your results, as was wondering if you have hit the following issue before. I am using a Lambda Labs 8xH100 SXM5 instance, and run the following commands from a fresh instanc…

swookey-thinky updated 6 days ago
3
vllm-project/vllm #7878

[Bug]: Requests larger than 75k input tokens cause `Input p…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTor…

servient-ashwin updated 3 weeks ago
6
NVIDIA/nccl #974

How to understand the log output by NCCL

Hi,sjeaugey. The performance of my test using NCCL+MPIURN is not as expected.My environment is as follows:Node:1,H800*8,infiniband200Gbps*4;Node:2,H800*8,infiniband400GbpsIB*4;NCCL:2.18.3,CUDA:12.2,O…

Renoshen updated 1 year ago
8
NVIDIA/nccl #1216

Why does NCCL not utilize all channels when the data volume …

The code is as follows: ``` size_t bytePerChannel[/*collNetSupport*/2]; if (comm->channelSize > 0) { // Set by user bytePerChannel[/*collNetSupport=*/0] = comm->channelSize; byte…

ltrcc updated 8 months ago
4
NVIDIA/nccl #876

run all_reduce_perf in 2 machines, why hang at here?

Hi, the command I use is `NCCL_DEBUG=INFO mpirun -np 2 -hosts master,worker ./build/all_reduce_perf -b 4M -e 4M -f 2 -g 8` It works when I run mpi example. So, I think the problem is releted to my op…

FarmerLiuAng updated 1 year ago
6
vllm-project/vllm #5993

[Bug]: OOM when loading Qwen2 GPTQ 8bit and modify gpu_memor…

### Your current environment ```PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version:…

aabbccddwasd updated 2 weeks ago
6
NVIDIA/nccl #370

extremely long ncclAllGather

Hi, I have a CV training flow with sync batch normalization. However, the NCCL allgather in the **first** sync batch normalization of a forward pass is extremely long (>10s) while the NCCL all gath…

Michael-JY-He updated 4 years ago
5
NVIDIA/nccl #349

How to monitor slow nodes in ringallreduce

I am currently using Horovod for model training. The communication of the underlying gradient synchronization uses nccl. The problem of slow nodes will appear during the training process. Is there any…

Richie-yan updated 4 years ago
4

上一页 1...90 91 92 93 94 95 96...100 下一页

1000+ results for nccl

1000+ results
for nccl