-
### Problem Description
Hi Everyone,
I have found a very strange behavior in rccl-rocm-6.1.2 that I cannot understand based on my limited knowledge of LL implementation. The behavior is for AllG…
-
### 📚 The doc issue
I found these environment variables in the PyTorch code. Is there any document that describes the application scenarios?
TORCH_NCCL_BLOCKING_WAIT
TORCH_NCCL_ASYNC_ERROR_HANDLING…
-
**Describe the bug**
使用swift sft 命令微调MiniCPM-v-2.6模型时,训练到中途突然报错:
Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on…
-
**Describe the bug**
When I am using the most recent Megatrone-LM fork I get the following error
```
make: Entering directory '/workspace/megatron-lm/megatron/core/datasets'
g++ -O3 -Wall -sha…
-
Hi @tdrussell,
First of all, thank you so much for your helpful discussion in another issue earlier!
Now I am able to use qlora-pipe with deepspeed on two-node environment with 12 * 80 GB GPUs fo…
-
### Checklist
- [ ] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [ ] 3. Please note that if the bug-related issue y…
-
### Your current environment
```
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC vers…
-
[0] NCCL INFO cudaDriverVersion 11040
[0] NCCL INFO Bootstrap : Using eth0:10.84.253.70
[0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
File "/usr/loc…
-
Hello,
I am currently running deep learning workload on 2 nodes, both connected with RoCE. while running similar application in different environment, I am getting following speed performance.
…
-
We observed good overlap with FSDP + PGLE:
![Bq7PCuqyJbygSuL](https://github.com/user-attachments/assets/0cff27c4-6499-43d0-b436-ef01a2833ae0). Turning on and off PGLE makes a big difference here.
…