nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

NVIDIA/nccl #330

GDR Performance issue observed from v2.6.4-1 release

I am observing very big performance variance for GDR, from 6500 to 8500 imgs/sec with the following environemnts, _HW: 2 nodes with 8 GPUs for each and be connected via 25G Mellanox CX5. NCCL: v2…

kehuanfeng updated 4 years ago
20
NVIDIA/Megatron-LM #835

[BUG] The problems with bucket and shared_embedding.

**Describe the bug** ![image](https://github.com/NVIDIA/Megatron-LM/assets/39549453/c1e3ea24-e371-4818-9d9f-b916bb34e0fe) As shown in the figure above, `shared_embedding` and other parameters are di…

Baibaifan updated 1 month ago
3
pytorch/pytorch #133549

Dtensor shard uses more gpu memory than raw tensor

### 🐛 Describe the bug Dtensor shard uses more gpu memory than raw tensor. With test, Shard gpu mem: 21890MiB > Replicate gpu mem: 17448MiB > Raw tensor gpu mem: 16804MiB. Confused for a long time…

v4if updated 1 month ago
5
NixOS/nixpkgs #280431

cudaPackages: mark source-based releases as broken if cudaSu…

For source-available packages like CUDA samples, NCCL, NCCL-Tests, and Saxpy, we should mark them as broken if `cudaSupport` is false. My reasoning is this: when source-based packages generate code fo…

ConnorBaker updated 8 months ago
4
pytorch/pytorch #124714

`torch.distributed` hangs when using `torch.distributed.barr…

### 🐛 Describe the bug When running distributed program on multi-node and multi-device environment using the following scripts. (In my case, 2 nodes with 4 gpus each) run_ddp.sh ```bash #node 0 …

PHLens updated 3 weeks ago
6
conda-forge/pytorch-cpu-feedstock #60

Use more system libraries

I get that this comes at a cost, I just wanted to list these out in case they can help us get down to a below 6 hour build time: I found these variables in the `cmake/Depenencies.cmake` * `USE_SYS…

hmaarrfk updated 2 years ago
19
Algebraic-Programming/pytorch-hccl-tests #10

Add backend-agnostic correctness test, particularly for asyn…

Asynchronous/non-blocking communications are among the most critical optimizations in large model training, but they are prone to error. For example, `batch_isend_irecv` results in wrong data with NCC…

learning-chip updated 1 year ago
1
horovod/horovod #4047

support bfloat16

**train with bfloat16** Is there a plan to support bfloat16 training？@maxhgerlach

gl-001 updated 3 months ago
5
THUDM/CogCoM #25

添加'crop_and_zoomin'操作后训练会卡死

如上所示，若数据集中没有'crop_and_zoomin'操作时，则训练可以正常，但添加该操作后，训练会卡在fintune.py程序`broadcast_auto_com`函数中的`mpu.broadcast_data`下的`torch.distributed.broadcast`操作，然后返回如下结果： ` > [rank6]:[E ProcessGroupNCCL.cpp:523] […

terryII updated 2 months ago
1
NVIDIA/nccl #1319

How sendProxyProgress() in net.cc works

Hello! I used some tracing tools to trace all-reduce operation in NCCL and found that the execution of runRing in all_reduce.h in GPU are always related to sendProxyProgress() in net.cc which seems to…

ZhiyiHu1999 updated 3 months ago
2

上一页 1...94 95 96 97 98 99 100...100 下一页

1000+ results for nccl

1000+ results
for nccl