speculative-decoding Search Results

1000+ results
for speculative-decoding

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

huggingface/transformers #27712

Add support for llama.cpp

### Feature request I would like to request [llama.cpp](https://github.com/ggerganov/llama.cpp) as a new model backend in the transformers library. ### Motivation llama.cpp offers: 1) Exce…

oobabooga updated 1 month ago
17
huggingface/transformers #27640

Allow passing 2D attention mask

### Feature request Allow passing a 2D attention mask in `model.forward`. ### Motivation With this feature, it would be much easier to avoid cross-context contamination during pretraining and super…

UniverseFly updated 3 weeks ago
13
NVIDIA/TensorRT-LLM #2208

Accuracy Problem: Qwen speculative decoding, different outp…

### System Info [TensorRT-LLM] TensorRT-LLM version: 0.11.0 Driver Version: 470.199.02 CUDA Version: 12.4 GPU: A800 1gpu for qwen-14b-chat model, 1gpu for qwen-0.5b-chat model ### Who can help? @k…

jasica528 updated 1 month ago
1
vllm-project/vllm #8286

[Bug]: vllm 0.5.4 NCCL error when applying speculative decod…

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch…

Armod-I updated 1 month ago
2
langchain-ai/langchain #26839

Pydantic Validation Error when loading an VLLM

### Checked other resources - [X] I added a very descriptive title to this issue. - [X] I searched the LangChain documentation with the integrated search. - [X] I used the GitHub search to find a sim…

loss-and-quick updated 1 month ago
2
vllm-project/vllm #7907

[Bug]: Unable to use speculative decoding (KeyError: 40)

### Your current environment The output of `python collect_env.py` ```text Collecting environment information... WARNING 08-27 11:01:10 cuda.py:22] You are using a deprecated `pynvml` package.…

ccdv-ai updated 2 months ago
1
QwenLM/Qwen2-VL #55

vllm搭载7b的awq版本显存使用不符合预期

使用如下命令启动vllm（readme里描述的版本），并固定住block的数量为2048，每个block size大小为16 ```bash vllm serve /hestia/model/Qwen2-VL-7B-Instruct-AWQ --quantization awq --num-gpu-blocks-override 2048 --port 8002 --served-model-…

wciq1208 updated 1 month ago
6
irthomasthomas/undecidability #641

Guide to choosing quants and engines : r/LocalLLaMA

- [ ] [Guide to choosing quants and engines : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/comment/kprbduc/) # Guide to choosing quants and engines : r/LocalLLaMA **DESCRIPTIO…

irthomasthomas updated 8 months ago
1
dart-lang/sdk #25377

UTF8 encoding is slow

The following program encodes that same ASCII string using a naive approach and using actual `UTF8.encode()`. The naive approach is about `3 times` faster. Could UTF8 be optimized to provide better pe…

scheglov updated 6 months ago
7
vllm-project/vllm #9872

[Bug]: You are using a model of type qwen2_vl to instantiate…

### Your current environment vllm=0.6.3 ### Model Input Dumps You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can …

cqray1990 updated 2 days ago
15

上一页 1...22 23 24 25 26 27 28...100 下一页

1000+ results for speculative-decoding

1000+ results
for speculative-decoding