-
## 🐛 Bug
I'm running a training job with 2 nodes in SageMaker using torchrun to launch. I'm using a CombinedStreamingDataset for the training dataset and using `train_weight_factors = [0.8,0.07,0.0…
-
Hi! when I try to run a python [scripts](https://github.com/pytorch/PiPPy/blob/main/examples/llama/pippy_llama.py) for llm inference in pipeline parallelism on single server with multi GPUs. It turned…
-
### Your current environment
1、
torch 2.3.0+cu118
vllm 0.4.3+cu118
2、
[root@master1 v2]# pip show torch
Name: torch
Version: 2.3.0+cu118
Summary: Tensors and Dynamic neural networks in Python …
-
### Your current environment
The output of `python collect_env.py`
```text
Your output of `python collect_env.py` here
```
### Model Input Dumps
model = LLM("DeepSeek-Coder-V2-Lite-Bas…
-
### Your current environment
The output of `python collect_env.py`
```text
root@newllm201:/workspace# vim collect.py
root@newllm201:/workspace# python3 collect.py
Collecting environment info…
-
#! /bin/bash
NUM_WORKERS=1
NUM_GPUS_PER_WORKER=1
MP_SIZE=1
script_path=$(realpath $0)
script_dir=$(dirname $script_path)
main_dir=$(dirname $script_dir)
MODEL_TYPE="XrayGLM"
MODEL_ARGS="--ma…
-
### Feature request
Expand `AcceleratorConfig` and corresponding transformers trainer args to allow transformer users to use full feature set of accelerate through the config arguments supported by…
-
CI test **linux://python/ray/dag:tests/experimental/test_mocked_nccl_dag** is consistently_failing. Recent failures:
- https://buildkite.com/ray-project/postmerge/builds/6696#0192d1c2-1479-41d6-bf43…
-
### Your current environment
Using:
* vllm 0.4.1
* nccl 2.18.1
* pytorch 2.2.1
### 🐛 Describe the bug
During inference I sometimes get this error:
```bash
(RayWorkerWrapper pid=2376582…
-
I am using the Nsight system tool to observe the behavior of allreduce_perf on a server with 8 H800 gpus. I found that when the NCCL_P2P_USE_CUDA_MEMCPY function is enabled, the nsys profile command w…