nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

paperg/NCCL_GP #1

多机跨网络拓扑问题

你好，请问有调试过多机跨网络的nvlink_5GPU.xml 拓扑文件么？想看看NCCL_GP在这种case下跑的实际效果。

Ares107 updated 2 days ago
1
pytorch/pytorch #113245

NCCL error of PyTorch 2.1.0 when using multiple gpus

### 🐛 Describe the bug Hi, I encountered some NCCL error when using pytorch version 2.1.0 with multiple gpus. When I downgraded pytorch to 2.0.1, the error disappeared. ## Code export NCCL_D…

Galaxy-Husky updated 1 month ago
6
pytorch/pytorch #119345

all_reduce misaligned address with bfloat16

### 🐛 Describe the bug Hi! I am encountering the following error when using `torch.distributed.all_reduce` on bfloat16 tensors of a certain size using NCCL: `RuntimeError: CUDA error: misaligned ad…

alexisVallet updated 2 months ago
5
vllm-project/vllm #3688

[Bug]: Custom all reduce not work.

### Your current environment ``` PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC vers…

esmeetu updated 2 weeks ago
19
NVIDIA/TensorRT-LLM #1678

fail to use 4 gpu to run qwen1.5-72b-int4-gptq for generatio…

i use the latest code branch to build docker container. i have successfully converted the orignal weights to trt, and build the model but when i use the commond below to test my model:""" NCCL_DEB…

qiufengyuyi updated 4 months ago
2
bytedance/HLLM #12

NCCL error

![fig](https://github.com/user-attachments/assets/80398e7f-975b-4de1-9c9b-ff85633a5d77) code/overall/LLM_deepspeed.yaml, train_batch_size and eval_batch_size both set 1 NCCL error for single gpu, do…

walegahaha123 updated 2 weeks ago
2
Mellanox/nccl-rdma-sharp-plugins #151

Using SHARP failed which sharp_coll_comm_init running failed…

Hi developer, I have built the SHARP env, and the sharp plugin has been loaded successfylly. When run this function **sharp_coll_comm_init** , it return error, so finally the nccl use the P2P NET. …

shanleo2024 updated 1 month ago
9
NVIDIA/Megatron-LM #1280

[BUG]Megatron-LM doesn't support transformer-engine 1.13

**Describe the bug** Megatron-LM doesn't compatible with transformer-engine 1.13. in transformer-engine: https://github.com/NVIDIA/TransformerEngine/blob/2643ba1df43397cc84c9da5fe719a66d87ad9a0a/tr…

klhhhhh updated 17 hours ago
1
huggingface/diffusers #9500

Dreambooth Flux training failed on saving a checkpoint

### Describe the bug I run the training but get this error ### Reproduction Run accelerate config ``` compute_environment: LOCAL_MACHINE debug: false distributed_type: FSDP downcast_bf16: 'n…

kopyl updated 1 week ago
21
togethercomputer/OpenChatKit #19

Can't install nccl package

**Describe the bug** When trying to set up the conda environment, it is failing to install the nccl package. ``` (base) PS D:\OpenChatKit> conda env create -f environment.yml Collecting package …

tjarmain updated 1 year ago
5

上一页 1...24 25 26 27 28 29 30...100 下一页

1000+ results for nccl

1000+ results
for nccl