nccl Search Results - Githubissues

1000+ results
for nccl

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

Lightning-AI/litdata #340

CombinedStreamingDataset causes NCCL timeout when using mult…

## 🐛 Bug I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using `train_weight_factors = [0.8,0.07,0.0…

hubenjm updated 1 week ago
15
NVIDIA/nccl #1504

Encounter NCCL error when runing Pytorch example code

Hi! when I try to run a python [scripts](https://github.com/pytorch/PiPPy/blob/main/examples/llama/pippy_llama.py) for llm inference in pipeline parallelism on single server with multi GPUs. It turned…

Noblezhong updated 4 days ago
5
vllm-project/vllm #5679

[Bug]: When cuda_graph is enabled, RunTimeError:NCCL error i…

### Your current environment 1、 torch 2.3.0+cu118 vllm 0.4.3+cu118 2、 [root@master1 v2]# pip show torch Name: torch Version: 2.3.0+cu118 Summary: Tensors and Dynamic neural networks in Python …

askcs517 updated 2 weeks ago
3
vllm-project/vllm #9369

[Bug]: cannot run model when TP>1 (already run debug file)

### Your current environment The output of `python collect_env.py` ```text Your output of `python collect_env.py` here ``` ### Model Input Dumps model = LLM("DeepSeek-Coder-V2-Lite-Bas…

jli943 updated 3 weeks ago
2
vllm-project/vllm #7548

[Bug]: NCCL error: invalid usage (run with NCCL_DEBUG=WARN f…

### Your current environment The output of `python collect_env.py` ```text root@newllm201:/workspace# vim collect.py root@newllm201:/workspace# python3 collect.py Collecting environment info…

zhaotyer updated 2 months ago
5
WangRongsheng/XrayGLM #88

您好，模型运行，然后立马退出显示运行成功，请问是什么原因

#! /bin/bash NUM_WORKERS=1 NUM_GPUS_PER_WORKER=1 MP_SIZE=1 script_path=$(realpath $0) script_dir=$(dirname $script_path) main_dir=$(dirname $script_dir) MODEL_TYPE="XrayGLM" MODEL_ARGS="--ma…

2879982985 updated 3 weeks ago
1
huggingface/transformers #34481

Expand AcceleratorConfig to accommodate other features such …

### Feature request Expand `AcceleratorConfig` and corresponding transformers trainer args to allow transformer users to use full feature set of accelerate through the config arguments supported by…

kmehant updated 1 week ago
3
ray-project/ray #48288

CI test linux://python/ray/dag:tests/experimental/test_mocke…

CI test **linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag** is consistently_failing. Recent failures: - https://buildkite.com/ray-project/postmerge/builds/6696#0192d1c2-1479-41d6-bf43…

can-anyscale updated 5 hours ago
26
vllm-project/vllm #4653

[Bug]: NCCL timed out during inference

### Your current environment Using: * vllm 0.4.1 * nccl 2.18.1 * pytorch 2.2.1 ### 🐛 Describe the bug During inference I sometimes get this error: ```bash (RayWorkerWrapper pid=2376582…

enkiid updated 2 weeks ago
7
NVIDIA/nccl #1480

The nsys profile will hang when NCCL_P2P_USE_CUDA_MEMCPY is …

I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command w…

PhdShi updated 3 weeks ago
5

上一页 1...1 2 3 4 5 6 7...100 下一页

1000+ results for nccl

1000+ results
for nccl