nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

LLMServe/DistServe #46

Error in building SwiftTransformers (error: more than one co…

Hi all. I encountered some problems when building SwiftTransforemer as the dependency for DistServer. My GPU is 4090 nvcc: NVIDIA(R)Cuda compiler driver Copyright(c)2005-2023 NVIDIA corporation Bu…

lylcyl updated 3 weeks ago
3
NVIDIA/nccl #450

Optimize NCCL AllReduce performance in 100Gbps TCP network

Happy New Year! NCCL developers and community members. Recently I am trying to find the upper bound of NCCL allreduce performance in our network environment. I tried various methods and referred t…

shjwudp updated 3 years ago
11
NVIDIA/nccl #1030

Question about PXN all2all

Hi NCCL team. I have read your blog [doubling all2all performance with nccl 2.12](https://developer.nvidia.com/blog/doubling-all2all-performance-with-nvidia-collective-communication-library-2-12). …

ConnollyLeon updated 1 year ago
1
NVIDIA/nccl #1111

AI/ML training hangs up with no error report from NCCL

magatron AL/ML training hangs up with error messages as following. ReduceScatter failed to be finished within the timeout (30mins). It is tricky that no error log reported from NCCL. I have no idea ho…

yanminjia updated 5 months ago
4
NVIDIA/nccl #1102

Need topo file for my H100 cluster for gpu baremetal devices

Hi Team, I am running nccl-bw:ib test for H100 cluster using superbench. But the bandwidth we are getting is very less. Like it's around 64Gb/s and it should come around 400 Gb/s. I am currently n…

nancyagarwal1301 updated 11 months ago
1
Lightning-AI/pytorch-lightning #17066

SLURM training: training freezes when using `ddp` and torchd…

### Bug description Training freezes when using `ddp` on slurm cluster (`dp` runs as expected). The dataset is loaded via torchdata from an s3 bucket. Similar behaviour also arises when using webda…

knoriy updated 1 year ago
12
hpcaitech/ColossalAI #3137

[BUG]: GPT single node multi-card training occurred NCCL Er…

### 🐛 Describe the bug when I run [examples/language/gpt/gemini/run_gemini.sh](https://github.com/hpcaitech/ColossalAI/blob/main/examples/language/gpt/gemini/run_gemini.sh) scripts base on official…

tianxin1860 updated 1 year ago
2
NVIDIA/nccl #323

No official build of v2.5.7-1 for download

Hi NCCL team, Looks like there is no official build of v2.5.7-1 for download, at https://developer.nvidia.com/nccl/nccl-download. Do you plan to add it?

hankliu43 updated 4 years ago
1
aws/aws-ofi-nccl #395

Assistance to broader Tag releases

Hello team, I noticed you have been updating most of the last tagged releases with the `-aws` suffix. I think I can assist if you need help testing it on other libfabric providers. We have some …

caio-davi updated 2 months ago
6
NVIDIA/nccl #712

INT32 vs. FP16 performance on NCCL reduction

Hi there, I wanna ask about the performance comparison between int32 and fp16 datatype when using the allreduce API. I am not sure it's normal or not, but the int32 latency is almost 6x larger than…

minghaoBD updated 2 years ago
1

上一页 1...83 84 85 86 87 88 89...100 下一页

1000+ results for nccl

1000+ results
for nccl