-
First of all,
thank you very much for building all those retrocompatible pytorch binaries for the NVIDIA Tesla K40.
I am currently working on distributed computing using the NCCL backend (GPUs).
T…
-
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [E Process…
-
# env
* env1: 3 HGX-H100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver 535.129.03
* env2: 3 HGX-A100 with totally 24 GPUs. Same baremetal hardwares&envs with nvidia driver…
-
### 🐛 Describe the bug
When using `torch.distributed._state_dict_utils._broadcast_tensors`, it is possible for tensors which need to be broadcasted to live on the CPU (such as with a CPU offloaded …
-
During running the follwing example with sanitizer:
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/examples.html#example-1-single-process-single-thread-multiple-devices
I am facing the ne…
-
### 简介
horovod是支持pytorch,tensorflow,mxnet多机分布式训练的库,其底层机器间通讯依赖nccl或mpi,所以安装前通常需要先安装好nccl、openmpi,且至少安装了一种深度学习框架,譬如mxnet:
```shell
python3 -m pip install gluonnlp==0.10.0 mxnet-cu102mkl==1.6.0.post0…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…
-
Environment:
* CUDA: 11.3
* NCCL: 2.12
* Pytorch: 1.10.0
I came up with the following errors when compiling pytorch 1.10.0 with [NCCL v2.12](https://github.com/NVIDIA/nccl/commit/d427af5d94dc8…
-
![2021-11-24 10-29-57屏幕截图](https://user-images.githubusercontent.com/20316898/143160662-aae74066-7ece-4c89-8573-207e1b77bec5.png)
There will be some problems when i use --user-dir=${LIGHTSEQ_DIR}/…
-
### Describe the bug
This time i set amount of steps to 2 to make sure it correctly saves the model after an hour of training. But it does not.
### Reproduction
Run `accelerate config`
```
comp…