nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

NVIDIA/nccl-tests #95

Tests do not build/run with nvhpc -- missing link to CUDA Ru…

System: Perlmutter. Modules / Software we are compiling with: PrgEnv-nvidia/8.2.0 & nvidia/21.7. We at NERSC, have these NCCL test's as our reframe test. Currently all test fails if we want to r…

ronnieChatt updated 3 years ago
1
alpa-projects/alpa #950

cupy package mismatches with CUDA version in the docs

**Please describe the bug** Hi, according to the [alpa installation doc](https://alpa.ai/install.html), we need to `pip3 install cupy-cuda11x` to install cupy. However, when CUDA version is 11.1, acc…

serach24 updated 1 year ago
2
horovod/horovod #1913

Error when use XLA and HOROVOD_HIERARCHICAL_ALLREDUCE

**Environment:** 1. Framework: TensorFlow 2. Framework version: 1.15.0 3. Horovod version: 0.19.1 4. MPI version: 4.0.2 5. CUDA version: 10.0 6. NCCL version: 2.4.7 7. Python version: 3.6.8 8…

Agoniii updated 4 years ago
4
vmware-archive/bitfusion-with-kubernetes-integration #43

Can we use bitfusion to run Distributed Data Parallel Pytorc…

Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server with…

ljz756245026 updated 2 years ago
2
vllm-project/vllm #6717

[Bug]: custom docker Error

### Your current environment ```text The output of `python collect_env.py` ``` Differences between docker and local in docker: ``` CUDA runtime version:Could not collect cuDNN version: 9.0…

ciaoyizhen updated 5 days ago
8
pytorch/pytorch #37004

RuntimeError: NCCL error in ProcessGroupNCCL.cpp:290, unhand…

I came across this error `RuntimeError: NCCL error in ProcessGroupNCCL.cpp:290, unhandled system error` when trying to distribute neural network training to 4 GPUs in a single node with PyTorch 1.2. A…

qysnn updated 4 years ago
2
FunAudioLLM/CosyVoice #263

Continue training on Flow matching: the loss downgrades but …

Hi, when I tried to continue the training on the Conditional flow-matching on a new dataset (zh collected from youtube), I found that the loss degradee a lot but the generate audio is totally unintell…

huskyachao updated 2 months ago
10
mathmanu/caffe-jacinto-models #17

Get error when use 3 or 4 gpus to train model

I installed NCCL to use more gpus to train model. install step: 1. git clone https://github.com/NVIDIA/nccl.git 2. cd nccl 3. sudo make install -j8 4. remove Makefile.config USE_NCCL comment W…

yumeihong updated 4 years ago
2
NVIDIA/nccl #431

Feature request - using 2 GPU workers on one large GPU (A100…

# Setup - A multi-GPU rig, having top of the line GPUs: - Several 3090 GPUs; - Or several A100 GPUs; - A `pytorch:1.7.0-cuda11.0-cudnn8-devel` container derivative; - Latest `docker`, `nvid…

snakers4 updated 2 years ago
5
OpenRLHF/OpenRLHF #460

使用adam_offload后，训练完模型save_model时报错

16卡全参数微调72B模型后，保存模型时报错，大佬么帮忙看看！ dev-worker-0: [2024-10-12 06:35:03,642] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python', '-u', '-m', 'openrlhf.cli.train_sft', '--local_rank=7', '--max_len'…

pythonla updated 3 weeks ago
3

上一页 1...92 93 94 95 96 97 98...100 下一页

1000+ results for nccl

1000+ results
for nccl