speculative-decoding Search Results

1000+ results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

PygmalionAI/aphrodite-engine #497

[Bug]: Segmentation fault (core dumped)

### Your current environment ``` (vllm-gptq) root@k8s-master01:/workspace/home/lich/QuIP-for-all# pip3 list | grep aphrodite aphrodite-engine 0.5.3 /workspace/home/lich/aphrodite-eng…

ChuanhongLi updated 2 months ago
1
vllm-project/vllm #7389

[Bug]: Successfully deployed embedding model 'gte-Qwen2-7B-i…

### Your current environment The output of `python collect_env.py` ```text PyTorch version: 2.4.0+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N…

Dielianss updated 2 months ago
5
vllm-project/vllm #6777

[Performance]: Medusa SD have poor performance than baseli…

### Proposal to improve performance Test new feature medusa speculative sampling with [vllm v0.5.2](vllm-openai:v0.5.2). After using Medusa speculative sampling, the performance dropped significantl…

deepindeed2022 updated 2 weeks ago
6
vllm-project/vllm #6225

[Bug]: benchmark_throughput gets TypeError: XFormersMetadat…

### Your current environment ```text PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC …

LGLG42 updated 1 day ago
14
unslothai/unsloth #1112

How to save a llama3.2 model?

### Error ``` --------------------------------------------------------------------------- RuntimeError Traceback (most recent call last) Cell In[6], line 1 ----…

fzyzcjy updated 3 weeks ago
2
vllm-project/vllm #7627

[Documentation request]: Add documentation on lossless guara…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch…

jmkuebler updated 2 weeks ago
12
vllm-project/vllm #6155

[Usage]: How to use Multi-instance in Vllm? (Model replicati…

I would like to use techniques such as Multi-instance Support supported by the tensorrt-llm backend. In the documentation, I can see that multiple models are served using modes like Leader mode and …

KimMinSang96 updated 2 weeks ago
12
theroyallab/tabbyAPI #166

[BUG] speculative decoding too slow

### OS Linux ### GPU Library CUDA 12.x ### Python version 3.11 ### Describe the bug When running exllamav2's inference_speculative.py example with llama 3.1 8B 2.25bpw as draft and 70B 4.5bpw a…

randoentity updated 2 months ago
3
NVIDIA/TensorRT-LLM #1946

How to use Medusa to support non llama models?

### System Info Hardware: L20 Version: 0.11.0.dev20240625 Model: Bloom7b1 ### Who can help? @ncomly-nvidia @byshiue I have obtained the Medusa head for Bloom according to the official M…

skyCreateXian updated 2 months ago
8
facebookresearch/xformers #1138

vllm 0.6.3 createLLM error TypeError: autotune() got an unex…

# 🐛 Bug from vllm import LLM, SamplingParams llm = LLM(model=model_dir,enforce_eager=True) then ``` File d:\my\env\python3.10.10\lib\site-packages\xformers\ops\fmha\_triton\splitk_kernels.…

xiezhipeng-git updated 2 days ago
4

上一页 1...21 22 23 24 25 26 27...100 下一页

1000+ results for speculative-decoding

1000+ results
for speculative-decoding