nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vllm-project/vllm #9156

[Misc]: Segmentation Fault in vLLM API Server during Model I…

### Anything you want to discuss about vllm. I'm experiencing a segmentation fault while running the vLLM API server with Ray for distributed inference. The issue seems to be related to NCCL initiali…

shreyasp-07 updated 1 month ago
1
rbalestr-lab/stable-SSL #49

[Feature Request] Remove submitit as a hard dependency for m…

Currently, sumitit launcher is used to spawn multiple processes. We want to use multiprocesses.spawn if the user doesn't want to use submitit.

vipulSharma18 updated 1 week ago
1
NVIDIA/nccl-tests #245

What does dma_buf do when gpuDirectRdma is disabled ?

Running nccl test with 2 nodes with one A10G on each node with GDR disabled. Why do I see the following line in the logs "DMA-BUF is available on GPU device 0". Will DMA_BUF be used when GDR is disa…

Pavani-Panakanti updated 3 months ago
1
Azure/msccl #37

Cannot use msccl-tools' xml

Why wasn't the method I generated using msccl-tools from the XML invoked when I executed the command ： >mpirun --allow-run-as-root -np 8 -x LD_LIBRARY_PATH=/home/msccl-tool/msccl/executor/msccl-exe…

Eevan-zq updated 2 months ago
2
NVIDIA/nccl #1055

[Bug] NCCL all_reduce failed with A800 when NCCL_ALGO uses R…

**TL/DR:** **Set env variable NCCL_ALGO=Tree if you meet accuracy problems with NCCL in A800 hardware.** -----------------------------------------------------------------------------------------…

zigzagcai updated 1 month ago
18
NVIDIA/Megatron-LM #1142

[QUESTION]NCCL timeout error when running the second iterati…

I use one machine and 4GPUs to run gpt3； the first iteration is runnning without any errors, but the second iteration makes errors , strucked by the second iteration and the second step, the erros as…

zmtttt updated 1 month ago
3
THUDM/GLM-130B #132

NCCL RuntimeError

After run successfully and passed several minutes, it occured this error: **RuntimeError: NCCL communicator was aborted on rank 2. Original reason for failure was: [Rank 2] Watchdog caught collect…

edwardelric1202 updated 1 year ago
1
NVIDIA/nccl #1473

Why tree algorithms are specifically targeted at All-Reduce?

I'm running nccl-test `all-reduce` between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed tha…

jxh314 updated 1 month ago
1
NVIDIA/nccl #1506

allgather performance using NVLS is poor

I did a test of allgather using the NVLS algorithm and find the performance is poor compared the allreduce using NVLS on H20 with 8 GPUs. The bandwidth of allgather using the NVLS is only 300GB while …

telala updated 3 days ago
13
ray-project/ray #47258

[aDAG] More intuitive API for (NCCL) type hints

### Description The end user may have an impression that type hint is applied to a DAG node, as opposed to the edge between DAG nodes/tasks. This might be partially due to that the way we name t…

ruisearch42 updated 2 months ago
1

上一页 1...8 9 10 11 12 13 14...100 下一页

1000+ results for nccl

1000+ results
for nccl