nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

NVIDIA/nccl #1446

[ext-net] is bundling headers still recommended?

While looking at the contents of the [2.23.4 pypi wheel](https://pypi.org/project/nvidia-nccl-cu12/2.23.4/), I noticed that that ext-net's `nccl_net.h` is now included in the package. ``` $ find venv…

aws-nslick updated 1 month ago
1
Lightning-AI/pytorch-lightning #19544

NCCL when trying to train on 2 nodes

### Bug description I am trying to run a very simple training script for 2 nodes and I always get this error: Output: ``` (ve) root@442a8ba5c0c6:~/ptl# . wr.sh Start fitting... Initializing…

waynemystir updated 3 weeks ago
4
charent/ChatLM-mini-Chinese #47

Some NCCL operations have failed or timed out.

rank1]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [r…

dbcSep03 updated 1 month ago
6
ray-project/ray #45319

[core][experimental] Higher than expected overhead for share…

### What happened + What you expected to happen [Microbenchmark](https://github.com/ray-project/ray/blob/master/python/ray/_private/ray_experimental_perf.py#L150) results for a single-actor acceler…

stephanie-wang updated 2 days ago
16
NVIDIA/nccl #1373

NCCL panic with nccl-test with 2 GPUs inside kubevirt VM

I have a setup with 2 nodes with kubevirt VMs running with 2 gpus, ` mpirun --allow-run-as-root --show-progress -H 10.194.9.3,10.194.10.5 -map-by node -np 2 -x PATH -x NCCL_IB_GID_INDEX=3 -x NCCL_D…

winsopc updated 3 months ago
10
alpa-projects/alpa #937

Fail to run alpa test

**Please describe the bug** **Please describe the expected behavior** **System information and environment** - OS Platform and Distribution (e.g., Linux Ubuntu 16.04, docker):Linux Ubuntu 18.04…

gaow0007 updated 1 month ago
2
NVIDIA/nccl #1439

ALLREDUCE timeout

I met the situation when I trained AllSpark on 2 RTX 3090. I have tried so many ways such as increasing 'timeout' of init_process_group, increasing NCCL_BUFFSIZE, set NCCL_P2P_LEVEL=NVL. But all of th…

THEWEAKEST updated 1 month ago
10
jasperzhong/cs-notes #20

learn NCCL

https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/index.html

jasperzhong updated 3 years ago
1
vllm-project/vllm #2734

Watchdog caught collective operation timeout for llama2-70b-…

See full log below. It can handle the first few requests and then getting stuck ``` 2024-02-03 09:54:56,181 INFO worker.py:1724 -- Started a local Ray instance. INFO 02-03 09:54:57 llm_engine.py:70…

flexwang updated 1 week ago
2
swz30/Restormer #75

About training, NCCL

RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554786529/work/torch/lib/c10d/ProcessGroupNCCL.cpp:33, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA func…

jjjjzyyyyyy updated 2 years ago
1

上一页 1...9 10 11 12 13 14 15...100 下一页

1000+ results for nccl

1000+ results
for nccl