speculative-decoding Search Results

1000+ results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

THUDM/CodeGeeX4 #23

vllm加载模型之后没推理，一直满GPU占用，是怎么回事？

代码如下: ``` from transformers import AutoTokenizer from vllm import LLM, SamplingParams max_model_len, tp_size = 131072, 1 model_name = "/models/codegeex4-all-9b" tokenizer = AutoTokenizer.from_pr…

luguoyixiazi updated 1 month ago
5
vllm-project/vllm #8893

[Bug]: RuntimeError: Cannot re-initialize CUDA in forked sub…

I have updated to the latest version and used the “spawn” method, `export VLLM_WORKER_MULTIPROC_METHOD=spawn` but the error still persists. Could you please help me?

Hothan01 updated 4 days ago
9
vllm-project/vllm #8085

[Usage]: How to stop vllm serving properly?

### Your current environment The output of `python collect_env.py` ``` PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS…

phisinger updated 2 months ago
3
vllm-project/vllm #7357

[Bug]: `gemma-2-27b-it-GGUF`: `Architecture gemma2 not suppo…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTor…

alllexx88 updated 2 weeks ago
5
QwenLM/Qwen2-VL #253

qwen2-vl-72b-instruct and qwen-vl-72b-instruct on ali cloud …

I met following questions: INFO 09-22 21:48:03 api_server.py:495] vLLM API server version 0.6.1 INFO 09-22 21:48:03 api_server.py:496] args: Namespace(host='0.0.0.0', port=40116, uvicorn_log_level='…

xinzaifeixiang1992 updated 1 month ago
2
PygmalionAI/aphrodite-engine #471

[Bug]: Cannot load 70b exl2 5bpw model across 4 GPUs.

### Your current environment conda nccl v2.21.5.1 ### 🐛 Describe the bug I have 4 GPUs. 3x3090 and 1x2080ti 22g. I try to load cat llama 70b 5.0bpw exl2 with aphrodite. If I don't disable …

Ph0rk0z updated 5 months ago
14
flashinfer-ai/flashinfer #152

QUESTION: How to implement a tree attention with flashinfer

Hi, thanks for your awesome work! I'm trying to implement https://github.com/SafeAILab/EAGLE with high-performance kernels. I read [this blog](https://flashinfer.ai/2024/02/02/introduce-flashinfer.…

UranusSeven updated 2 months ago
11
niandeng/apple-http-osmf #24

Refactored for multi bitrate support including example

``` http://static.electroteque.org.s3.amazonaws.com/download/apple-osmf.zip Here is the refactored code as a library now with a working example of the m3u8 parsing and multi bitrate setup. I'm not s…

GoogleCodeExporter updated 9 years ago
34
ddekster/apple-http-osmf #24

Refactored for multi bitrate support including example

``` http://static.electroteque.org.s3.amazonaws.com/download/apple-osmf.zip Here is the refactored code as a library now with a working example of the m3u8 parsing and multi bitrate setup. I'm not s…

GoogleCodeExporter updated 9 years ago
34
SafeAILab/EAGLE #114

Support Qwen2 in Eagle's inference framwork

Qwen2 eagle model now has been uploaded to hf repo. I can't wait to test its performance. However, it seems that eagle's inference framework doesn't support qwen2, when will it officially be support…

xiaoyong-z updated 3 months ago
3

上一页 1...45 46 47 48 49 50 51...100 下一页

1000+ results for speculative-decoding

1000+ results
for speculative-decoding