-
I am observing very big performance variance for GDR, from 6500 to 8500 imgs/sec with the following environemnts,
_HW: 2 nodes with 8 GPUs for each and be connected via 25G Mellanox CX5.
NCCL: v2…
-
**Describe the bug**
![image](https://github.com/NVIDIA/Megatron-LM/assets/39549453/c1e3ea24-e371-4818-9d9f-b916bb34e0fe)
As shown in the figure above, `shared_embedding` and other parameters are di…
-
### 🐛 Describe the bug
Dtensor shard uses more gpu memory than raw tensor.
With test, Shard gpu mem: 21890MiB > Replicate gpu mem: 17448MiB > Raw tensor gpu mem: 16804MiB.
Confused for a long time…
-
For source-available packages like CUDA samples, NCCL, NCCL-Tests, and Saxpy, we should mark them as broken if `cudaSupport` is false. My reasoning is this: when source-based packages generate code fo…
-
### 🐛 Describe the bug
When running distributed program on multi-node and multi-device environment using the following scripts. (In my case, 2 nodes with 4 gpus each)
run_ddp.sh
```bash
#node 0
…
-
I get that this comes at a cost, I just wanted to list these out in case they can help us get down to a below 6 hour build time:
I found these variables in the `cmake/Depenencies.cmake`
* `USE_SYS…
-
Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, `batch_isend_irecv` results in wrong data with NCC…
-
**train with bfloat16**
Is there a plan to support bfloat16 training?@maxhgerlach
-
如上所示,若数据集中没有'crop_and_zoomin'操作时,则训练可以正常,但添加该操作后,训练会卡在fintune.py程序`broadcast_auto_com`函数中的`mpu.broadcast_data`下的`torch.distributed.broadcast`操作,然后返回如下结果:
`
> [rank6]:[E ProcessGroupNCCL.cpp:523] […
-
Hello! I used some tracing tools to trace all-reduce operation in NCCL and found that the execution of runRing in all_reduce.h in GPU are always related to sendProxyProgress() in net.cc which seems to…