-
afol-apiserver-72b-1 | (RayWorkerVllm pid=3779) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16487777, OpType=ALLREDUCE, NumelIn=195911680, Nume…
-
For megatron-lm train with flash-ckpt, when set `pipeline parallel`, can not save sucessfully. It seems to not all ckpt save to memory.
`Skip persisting the checkpoint of step 60 because the cached …
-
LLaMA-Factory是不是版本啥的不对了?还是我没安装好?
全部的报错如下:
`
(langchain) zeng@zeng:~/llm/medical-chatbot$ sh run_training.sh
04/29/2024 15:43:19 - WARNING - llmtuner.hparams.parser - We recommend enable mixed p…
-
### Your current environment
```
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.3 LTS (x86_64)
GCC vers…
-
Here is the Google Colab link I used for fine-tuning :
[https://colab.research.google.com/drive/1kiALBR1UarPobiftZmiHfwFyk7hTCDnV?usp=sharing](url)
When I fine-tune the LLM-embed for tool retriev…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ub…
-
### 🚀 The feature, motivation and pitch
I need a way to specify which gpu exactly should vllm use when multiple gpus are available. Currently, it automatically occupies all available gpus (https://do…
-
### Your current environment
```text
The output of `python collect_env.py`
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
### Problem Description
I install rDP and do tracing example follow the README.md. But it run Aborted(failed)
root@tw024:/ws/Try_rPD# runTracer.sh python matmult_gpu.py
Creating empty rpd: tra…
-
### Your current environment
Using latest available docker image: vllm/vllm-openai:v0.5.0.post1
### 🐛 Describe the bug
I am getting as response "Internal Server Error" when calling the /v1/embedd…