-
To the best of my knowledge, speculative decoding does not change the decoding result when using greedy decoding. However, I noticed that the rouge2 metrics of 'base' and 'essg' may be different in th…
-
# Progress
- [x] Implement TPU executor that works on a single TPU chip (without tensor parallelism) #5292
- [ ] Support tensor parallelism for multiple chips in the same host #5871
- [ ] Suppo…
-
Code location: https://github.com/TabbyML/tabby/blob/main/crates/llama-cpp-bindings/src/engine.cc
Reference: https://github.com/ggerganov/llama.cpp/blob/master/examples/speculative/speculative.cpp#L4…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
Are there plans or a way to support left padding kv attention mask? I believe right padding can be supported with the mha_fwd_kvcache api with the seqlens_k_ pointer, but will there be a similar optio…
-
### Your current environment
```text
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 11 (bullseye) (x86…
-
When executing script `examples/offline_inference_with_prefix.py`, it will call `context_attention_fwd` from `vllm.model_executor.layers.triton_kernel.prefix_prefill`, which triggered the following er…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS…
-
### Your current environment
**But this is my host env, the Engine is running on the official latest docker image.**
```text
Collecting environment information...
PyTorch version: N/A
Is debug …
-
Hi!
I don't quite understand how this project works, I guess my main question is : `what is a draft model ? `
For example, I would like to speed-up the inference of OwlVit (https://huggingface.…