-
I'm facing a problem about nccl kernel overlaping with a cutlass gemm kernel.
I used a cutlass gemm kernel with a grid size of and my GPU has 142 SMs, so apparently there is a surplus of SMs. Then I…
-
### Describe the bug
I run the training but get this error
### Reproduction
Run accelerate config
```
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'n…
kopyl updated
3 minutes ago
-
Thank you for the effort creating and maintaining the Derecho codebase.
RDMC could be very useful for GPU-based data replication as well. However, RDMC in its current form does not support GPUDire…
tddg updated
4 months ago
-
**Your question**
Ask a clear and concise question about Flux.
I'm puzzled by how flux handles the problem of computation and communication competing for hardware resources when they overlap.
…
-
### System Info
- `transformers` version: 4.44.2
- Platform: Linux-5.15.0-1068-aws-x86_64-with-glibc2.31
- Python version: 3.9.19
- Huggingface_hub version: 0.24.7
- Safetensors version: 0.4.5
-…
-
https://github.com/NVIDIA/nccl/blob/178b6b759074597777ce13438efb0e0ba625e429/src/include/coll_net.h#L10
should add include ?
```
#include "comm.h" // should add include ?
#include "nccl.h"
#i…
-
### 📚 The doc issue
I found these environment variables in the PyTorch code. Is there any document that describes the application scenarios?
TORCH_NCCL_BLOCKING_WAIT
TORCH_NCCL_ASYNC_ERROR_HANDLING…
-
I am sharing this error in the hope that you find it useful. Below is the traceback. Let me know if you there's anything I can do to make it more verbose or any particular info you want about my envir…
-
Ray NCCL collectives fail allreduce on multi-GPU aws.G5 nodes because of an issue with how the node exposes topology information. The workaround is to apply `NCCL_P2P_DISABLE=1`, but this negatively i…
-
We have GPU cluster nodes with 8 * H100 and 4*400 RoCE. I try nccl test on this cluster with the same nodes. But I find tree bus bandwidth(150GB/s) is slower than ring bandwidth (190GB/s). From my…