nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

DecaYale/RNNPose #29

RuntimeError: NCCL error

When running the eval.py script with "--use_dist True", I am facing this error: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1607370128159/work/torch/lib/c10d/ProcessGroupNCCL.cpp:784, u…

AramNasser updated 2 months ago
1
aws-samples/awsome-distributed-training #433

Add support for NCCOM micro-benchmark for Kubernetes

Similar to NCCL tests for Kubernetes https://github.com/aws-samples/awsome-distributed-training/tree/main/micro-benchmarks/nccl-tests/kubernetes - it would be great if there was a similar test for NCC…

bryantbiggs updated 8 hours ago
1
NVIDIA/nccl #1453

Poor NCCL allreduce performance

We are seeing an issue with NCCL allreduce performance that we would appreciate Nvidia's help on. We have three nodes split across two racks: Two nodes on one rack and one node on another rack. Two-…

twichell updated 1 month ago
4
vllm-project/vllm #9329

[Bug]: Exception in worker VllmWorkerProcess while processin…

### Your current environment PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (U…

wangyao123456a updated 3 weeks ago
2
NCAR/spack-derecho #22

Install NCCL for Derecho

Would like to install NCCL as a dedicated module which can be linked into PyTorch / Tensorflow / other programs that want to use an optimized internode collective / peer communicaton. Repos to refe…

dphow updated 3 months ago
4
modelscope/ms-swift #2359

模型训练到固定step时， NCCL超时

**Describe the bug** What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图) 模型训练到固定step的时候，NCCL超时 ![6512a21092e02368ce384707d830cf8b](https://github.com/user-attachments/…

samaritan1998 updated 1 week ago
1
JarintotionDin/ZiRaGroundingDINO #4

Error Running Program on 4 GPU

Running the program on 4 GPUs, an error occurs at line 343 of train_multidatasets.py, getting stuck at the line results = evaluator.evaluate() in the inference_on_dataset function, The error message i…

witnessai updated 2 weeks ago
4
chrisdonahue/sheetsage #34

jukebox mode throws NCCL error

I am sharing this error in the hope that you find it useful. Below is the traceback. Let me know if you there's anything I can do to make it more verbose or any particular info you want about my envir…

pgolbus updated 3 weeks ago
2
PaddlePaddle/Paddle #69172

v3.0-beta2版本test_collective_reduce_scatter_api报错'paddle.base…

### bug描述 Describe the Bug NGC Paddle将会更新到v3.0-beta2，`test_collective_reduce_scatter_api.py`会报错`AttributeError: 'paddle.base.libpaddle.pir.Value' object has no attribute 'desc'`。我用Paddle官方提供的docker…

Wong4j updated 3 days ago
3
ROCm/ROCm #3956

[Issue]: When using multiple GPUs, an error will be reported…

### Problem Description I am going to use VLLM to start a QWEN model on an AMD GPU for testing. If I use a GPU to start it, it can start and use it normally. The log after startup is as follows: ` …

dotbalo updated 1 week ago
1

上一页 1...2 3 4 5 6 7 8...100 下一页

1000+ results for nccl

1000+ results
for nccl