-
### Feature request
I would like to request [llama.cpp](https://github.com/ggerganov/llama.cpp) as a new model backend in the transformers library.
### Motivation
llama.cpp offers:
1) Exce…
-
### Feature request
Allow passing a 2D attention mask in `model.forward`.
### Motivation
With this feature, it would be much easier to avoid cross-context contamination during pretraining and super…
-
### System Info
[TensorRT-LLM] TensorRT-LLM version: 0.11.0
Driver Version: 470.199.02
CUDA Version: 12.4
GPU: A800 1gpu for qwen-14b-chat model, 1gpu for qwen-0.5b-chat model
### Who can help?
@k…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…
-
### Checked other resources
- [X] I added a very descriptive title to this issue.
- [X] I searched the LangChain documentation with the integrated search.
- [X] I used the GitHub search to find a sim…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
WARNING 08-27 11:01:10 cuda.py:22] You are using a deprecated `pynvml` package.…
-
使用如下命令启动vllm(readme里描述的版本),并固定住block的数量为2048,每个block size大小为16
```bash
vllm serve /hestia/model/Qwen2-VL-7B-Instruct-AWQ --quantization awq --num-gpu-blocks-override 2048 --port 8002 --served-model-…
-
- [ ] [Guide to choosing quants and engines : r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/comments/1anb2fz/comment/kprbduc/)
# Guide to choosing quants and engines : r/LocalLLaMA
**DESCRIPTIO…
-
The following program encodes that same ASCII string using a naive approach and using actual `UTF8.encode()`. The naive approach is about `3 times` faster. Could UTF8 be optimized to provide better pe…
-
### Your current environment
vllm=0.6.3
### Model Input Dumps
You are using a model of type qwen2_vl to instantiate a model of type . This is not supported for all configurations of models and can …