speculative-decoding Search Results

889 results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

dilab-zju/self-speculative-decoding #16

Question about self-speculative + greedy decoding

To the best of my knowledge, speculative decoding does not change the decoding result when using greedy decoding. However, I noticed that the rouge2 metrics of 'base' and 'essg' may be different in th…

EganGu updated 2 months ago
2
vllm-project/vllm #3620

[RFC] Initial Support for Cloud TPUs

# Progress - [x] Implement TPU executor that works on a single TPU chip (without tensor parallelism) #5292 - [ ] Support tensor parallelism for multiple chips in the same host #5871 - [ ] Suppo…

WoosukKwon updated 3 days ago
10
TabbyML/tabby #732

Implement speculative decoding

Code location: https://github.com/TabbyML/tabby/blob/main/crates/llama-cpp-bindings/src/engine.cc Reference: https://github.com/ggerganov/llama.cpp/blob/master/examples/speculative/speculative.cpp#L4…

wsxiaoys updated 1 month ago
2
vllm-project/vllm #6220

[Bug]: Gemma2 supports 8192 context with sliding window, but…

### Your current environment ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A …

pseudotensor updated 1 week ago
6
Dao-AILab/flash-attention #649

Support for Left Padding Mask KV?

Are there plans or a way to support left padding kv attention mask? I believe right padding can be supported with the mha_fwd_kvcache api with the seqlens_k_ pointer, but will there be a similar optio…

aciddelgado updated 6 months ago
3
vllm-project/vllm #6558

[Bug]: Cannot load fp8 model of internlm2-chat-7b offline

### Your current environment ```text PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Debian GNU/Linux 11 (bullseye) (x86…

EstellaXinyuZhang updated 13 hours ago
2
vllm-project/vllm #2729

Assertion `!(srcMmaLayout && dstMmaLayout) && "Unexpected mm…

When executing script `examples/offline_inference_with_prefix.py`, it will call `context_attention_fwd` from `vllm.model_executor.layers.triton_kernel.prefix_prefill`, which triggered the following er…

gty111 updated 1 day ago
15
PygmalionAI/aphrodite-engine #494

[Usage]: OOM crash following Offline Inference setup

### Your current environment ```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS…

eedmond updated 1 month ago
3
PygmalionAI/aphrodite-engine #484

[Bug]: Moe's no longer working

### Your current environment **But this is my host env, the Engine is running on the official latest docker image.** ```text Collecting environment information... PyTorch version: N/A Is debug …

puppetm4st3r updated 1 month ago
3
MDK8888/GPTFast #14

Help to understand

Hi! I don't quite understand how this project works, I guess my main question is : `what is a draft model ? ` For example, I would like to speed-up the inference of OwlVit (https://huggingface.…

apirrone updated 4 months ago
6

上一页 1...15 16 17 18 19 20 21...89 下一页

889 results for speculative-decoding

889 results
for speculative-decoding