distributed-llm Search Results

1000+ results
for distributed-llm

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

vllm-project/vllm #3338

vllm.engine.async_llm_engine.AsyncEngineDeadError: Task fini…

afol-apiserver-72b-1 | (RayWorkerVllm pid=3779) [E ProcessGroupNCCL.cpp:475] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=16487777, OpType=ALLREDUCE, NumelIn=195911680, Nume…

EchoShoot updated 1 week ago
1
intelligent-machine-learning/dlrover #1146

megatron-lm flash-ckpt can not save ckpt to disk when use pi…

For megatron-lm train with flash-ckpt, when set `pipeline parallel`, can not save sucessfully. It seems to not all ckpt save to memory. `Skip persisting the checkpoint of step 60 because the cached …

Lzhang-hub updated 1 week ago
7
yichen-byte/medical-chatbot #3

在训练的时候提示：ValueError: Template chatglm3_raw does not exist.

LLaMA-Factory是不是版本啥的不对了？还是我没安装好？全部的报错如下： ` (langchain) zeng@zeng:~/llm/medical-chatbot$ sh run_training.sh 04/29/2024 15:43:19 - WARNING - llmtuner.hparams.parser - We recommend enable mixed p…

zengraoli updated 2 months ago
3
vllm-project/vllm #3688

[Bug]: Custom all reduce not work.

### Your current environment ``` PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC vers…

esmeetu updated 2 weeks ago
19
FlagOpen/FlagEmbedding #745

ValueError: Attempting to unscale FP16 gradients.

Here is the Google Colab link I used for fine-tuning : [https://colab.research.google.com/drive/1kiALBR1UarPobiftZmiHfwFyk7hTCDnV?usp=sharing](url) When I fine-tune the LLM-embed for tool retriev…

QuangTQV updated 6 months ago
4
vllm-project/vllm #5499

[Bug]: RuntimeError: out must have shape (total_q, num_heads…

### Your current environment ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ub…

zhihui96 updated 2 weeks ago
5
vllm-project/vllm #6312

[Feature]: control over llm_engine placement when multiple g…

### 🚀 The feature, motivation and pitch I need a way to specify which gpu exactly should vllm use when multiple gpus are available. Currently, it automatically occupies all available gpus (https://do…

ummagumm-a updated 2 weeks ago
1
vllm-project/vllm #5537

[Bug]: CUDA illegal memory access error when `enable_prefix_…

### Your current environment ```text The output of `python collect_env.py` PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A …

mpoemsl updated 1 month ago
17
ROCm/rocmProfileData #68

[Issue]: runTracer.sh trace aborted (Failed)

### Problem Description I install rDP and do tracing example follow the README.md. But it run Aborted(failed) root@tw024:/ws/Try_rPD# runTracer.sh python matmult_gpu.py Creating empty rpd: tra…

alexhegit updated 3 days ago
1
vllm-project/vllm #5827

[Bug]: Internal Server Error when hosting Alibaba-NLP/gte-Qw…

### Your current environment Using latest available docker image: vllm/vllm-openai:v0.5.0.post1 ### 🐛 Describe the bug I am getting as response "Internal Server Error" when calling the /v1/embedd…

markkofler updated 1 week ago
5

上一页 1...15 16 17 18 19 20 21...100 下一页

1000+ results for distributed-llm

1000+ results
for distributed-llm