nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

NVIDIA/nccl #1433

How to handle comp-comm overlapping?

I'm facing a problem about nccl kernel overlaping with a cutlass gemm kernel. I used a cutlass gemm kernel with a grid size of and my GPU has 142 SMs, so apparently there is a surplus of SMs. Then I…

chenhongyu2048 updated 1 week ago
6
huggingface/diffusers #9500

Dreambooth Flux training failed on saving a checkpoint

### Describe the bug I run the training but get this error ### Reproduction Run accelerate config ``` compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'n…

kopyl updated 3 minutes ago
6
Derecho-Project/derecho #281

Potential integration with NCCL and GDR

Thank you for the effort creating and maintaining the Derecho codebase. RDMC could be very useful for GPU-based data replication as well. However, RDMC in its current form does not support GPUDire…

tddg updated 4 months ago
3
bytedance/flux #39

[QUESTION] How does flux handle hardware resoureces competit…

**Your question** Ask a clear and concise question about Flux. I'm puzzled by how flux handles the problem of computation and communication competing for hardware resources when they overlap. …

chenhongyu2048 updated 1 week ago
2
huggingface/trl #2090

DDPO job with Accelerator fails in a multi-gpu node

### System Info - `transformers` version: 4.44.2 - Platform: Linux-5.15.0-1068-aws-x86_64-with-glibc2.31 - Python version: 3.9.19 - Huggingface_hub version: 0.24.7 - Safetensors version: 0.4.5 -…

shashankg7 updated 8 hours ago
1
NVIDIA/nccl #1361

Missing header file

https://github.com/NVIDIA/nccl/blob/178b6b759074597777ce13438efb0e0ba625e429/src/include/coll_net.h#L10 should add include ? ``` #include "comm.h" // should add include ? #include "nccl.h" #i…

alpha-baby updated 2 months ago
7
pytorch/pytorch #128204

Documentation for DDP-related environment variables

### 📚 The doc issue I found these environment variables in the PyTorch code. Is there any document that describes the application scenarios? TORCH_NCCL_BLOCKING_WAIT TORCH_NCCL_ASYNC_ERROR_HANDLING…

GuWei007 updated 2 months ago
5
chrisdonahue/sheetsage #34

jukebox mode throws NCCL error

I am sharing this error in the hope that you find it useful. Below is the traceback. Let me know if you there's anything I can do to make it more verbose or any particular info you want about my envir…

pgolbus updated 5 months ago
1
ray-project/ray #39471

[Ray GPU collectives] NCCL internal error on aws.G5 node

Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply `NCCL_P2P_DISABLE=1`, but this negatively i…

cadedaniel updated 1 month ago
5
NVIDIA/nccl #1368

NCCL test, Tree is slower than Ring

We have GPU cluster nodes with 8 * H100 and 4*400 RoCE. I try nccl test on this cluster with the same nodes. But I find tree bus bandwidth(150GB/s) is slower than ring bandwidth (190GB/s). From my…

wangdaw2023 updated 2 months ago
2

上一页 1...12 13 14 15 16 17 18...100 下一页

1000+ results for nccl

1000+ results
for nccl