-
你好,请问有调试过多机跨网络的nvlink_5GPU.xml 拓扑文件么?想看看NCCL_GP在这种case下跑的实际效果。
-
### 🐛 Describe the bug
Hi,
I encountered some NCCL error when using pytorch version 2.1.0 with multiple gpus.
When I downgraded pytorch to 2.0.1, the error disappeared.
## Code
export NCCL_D…
-
### 🐛 Describe the bug
Hi! I am encountering the following error when using `torch.distributed.all_reduce` on bfloat16 tensors of a certain size using NCCL: `RuntimeError: CUDA error: misaligned ad…
-
### Your current environment
```
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC vers…
-
i use the latest code branch to build docker container.
i have successfully converted the orignal weights to trt, and build the model
but when i use the commond below to test my model:"""
NCCL_DEB…
-
![fig](https://github.com/user-attachments/assets/80398e7f-975b-4de1-9c9b-ff85633a5d77)
code/overall/LLM_deepspeed.yaml, train_batch_size and eval_batch_size both set 1
NCCL error for single gpu, do…
-
Hi developer,
I have built the SHARP env, and the sharp plugin has been loaded successfylly.
When run this function **sharp_coll_comm_init** , it return error, so finally the nccl use the P2P NET.
…
-
**Describe the bug**
Megatron-LM doesn't compatible with transformer-engine 1.13.
in transformer-engine:
https://github.com/NVIDIA/TransformerEngine/blob/2643ba1df43397cc84c9da5fe719a66d87ad9a0a/tr…
-
### Describe the bug
I run the training but get this error
### Reproduction
Run accelerate config
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'n…
-
**Describe the bug**
When trying to set up the conda environment, it is failing to install the nccl package.
```
(base) PS D:\OpenChatKit> conda env create -f environment.yml
Collecting package …