speculative-decoding Search Results

1000+ results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

QwenLM/Qwen2.5 #685

Potential use cases for Qwen-0.5B

What are some of the intended use cases for the 0.5B model. There are not a lot of other similar sized models and neither is there a lot of hype around them. Though general audience seems to love th…

Tejaswgupta updated 3 months ago
2
vllm-project/vllm #7107

[Bug]: ValueError: The number of required GPUs exceeds the t…

### Your current environment [root@localhost wangjianqiang]# python -m vllm.entrypoints.openai.api_server --model /root/wangjianqiang/deepseek-moe/deepseek-coder-33b-base/ --tensor-parallel-size 8 …

WangJianQ-cmd updated 2 months ago
1
QwenLM/Qwen2-VL #51

VLLM 部署报错：Failed to import from vllm._C with ModuleNotFoundE…

拉取这个分支的VLLM: https://github.com/fyabc/vllm/tree/add_qwen2_vl_new 环境：Ubuntu + python 3.10 报错信息： python -m vllm.entrypoints.openai.api_server --served-model-name Qwen2-VL-7B-Instruct --model /home/…

jarrett-au updated 2 weeks ago
4
sgl-project/sglang #1105

[Develop] Performance Improving Feature

I want to develop some features based on Sglang to improve the performance of srt. 1. A new scheduler of ControllerMulti that can more accurately identify the resource utilization of each instance a…

yukavio updated 1 month ago
4
irthomasthomas/undecidability #655

At the Intersection of LLMs and Kernels - Research Roundup

- [ ] [At the Intersection of LLMs and Kernels - Research Roundup](https://charlesfrye.github.io/programming/2023/11/10/llms-systems.html) # At the Intersection of LLMs and Kernels - Research Roundup…

irthomasthomas updated 7 months ago
1
vllm-project/vllm #8242

[Bug]: GPU Memory Utilization Lower Than Expected with --ena…

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A…

hxer7963 updated 2 weeks ago
5
Vaibhavs10/insanely-fast-whisper #96

[Benchmarking] Thorough benchmarking for Transformers!

I am starting this issue to do a more thorough benchmarking than the [notebooks](/notebooks) used in the repo. What should we measure: 1. Time for generation 2. Max GPU VRAM 3. Accuracy Hardw…

Vaibhavs10 updated 9 months ago
1
vllm-project/vllm #5814

[Bug]: Test_skip_speculation fails in distributed execution

### Your current environment ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A …

wooyeonlee0 updated 2 months ago
9
InternLM/lmdeploy #1817

[Feature] Option to also use host memory for the KV cache

### Motivation KV cache hit rates are probably the biggest performance impact for me, and I recently read: https://research.character.ai/optimizing-inference/ > To solve this problem, we deve…

josephrocca updated 1 week ago
1
hemingkx/SpeculativeDecodingPapers #6

The code of "Parallel Speculative Decoding with Adaptive Dra…

Thanks for your great work! Could you please update some resources to our paper "Parallel Speculative Decoding with Adaptive Draft Length"? I have attached a link to our blog and codebase below for yo…

smart-lty updated 1 week ago
1

上一页 1...9 10 11 12 13 14 15...100 下一页

1000+ results for speculative-decoding

1000+ results
for speculative-decoding