-
System: Perlmutter.
Modules / Software we are compiling with: PrgEnv-nvidia/8.2.0 & nvidia/21.7.
We at NERSC, have these NCCL test's as our reframe test. Currently all test fails if we want to r…
-
**Please describe the bug**
Hi, according to the [alpa installation doc](https://alpa.ai/install.html), we need to `pip3 install cupy-cuda11x` to install cupy. However, when CUDA version is 11.1, acc…
-
**Environment:**
1. Framework: TensorFlow
2. Framework version: 1.15.0
3. Horovod version: 0.19.1
4. MPI version: 4.0.2
5. CUDA version: 10.0
6. NCCL version: 2.4.7
7. Python version: 3.6.8
8…
-
Recently, I have got a VM with 2 A100 GPU. I want to use these VM to run data parallel through Pytorch. However, I meet several problems with the environment. I have succeeded on my lab's server with…
-
### Your current environment
```text
The output of `python collect_env.py`
```
Differences between docker and local
in docker:
```
CUDA runtime version:Could not collect
cuDNN version: 9.0…
-
I came across this error `RuntimeError: NCCL error in ProcessGroupNCCL.cpp:290, unhandled system error` when trying to distribute neural network training to 4 GPUs in a single node with PyTorch 1.2. A…
qysnn updated
4 years ago
-
Hi, when I tried to continue the training on the Conditional flow-matching on a new dataset (zh collected from youtube), I found that the loss degradee a lot but the generate audio is totally unintell…
-
I installed NCCL to use more gpus to train model.
install step:
1. git clone https://github.com/NVIDIA/nccl.git
2. cd nccl
3. sudo make install -j8
4. remove Makefile.config USE_NCCL comment
W…
-
# Setup
- A multi-GPU rig, having top of the line GPUs:
- Several 3090 GPUs;
- Or several A100 GPUs;
- A `pytorch:1.7.0-cuda11.0-cudnn8-devel` container derivative;
- Latest `docker`, `nvid…
-
16卡全参数微调72B模型后,保存模型时报错,大佬么帮忙看看!
dev-worker-0: [2024-10-12 06:35:03,642] [ERROR] [launch.py:325:sigkill_handler] ['/usr/bin/python', '-u', '-m', 'openrlhf.cli.train_sft', '--local_rank=7', '--max_len'…