-
### System Info
- GPU: nvidia A30
- TensorRT-LLM: commit [32ed92e](https://github.com/chiendb97/TensorRT-LLM/commit/32ed92e4491baf2d54682a21d247e1948cca996e)
- Nvidia driver: 535.86.10
- Ubuntu 22.04…
-
### Proposal to improve performance
I use lm-evaluation-harness to test vllm accuracy
1.when don't enable spec decode,I got some result below
num_concurrent=1
![image](https://github.com/user-atta…
-
## ❓ General Questions
What is the meaning behind `draft_count`, `accept_count`, and `spec_draft_length`? Thank you in advance!
-
你好,首先非常感谢这个非常棒的开源工程的工作!我在按照安装说明安装好依赖和mathlib后,执行quick_start.py,但是并没有得到预期结果,NN模型有正确输出结果,但是lean4的验证有问题。
python quick_start.py
Special tokens have been added in the vocabulary, make sure the associated…
-
### Your current environment
```text
Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
…
-
### OS
Linux
### GPU Library
CUDA 12.x
### Python version
3.12
### Describe the bug
When a model is loaded inline, it doesn't respect the parameters set in config.yml, such as when loading a mo…
-
I follow the Doc: https://github.com/intel-analytics/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_cpu_docker_quickstart.md
My Ecs is using Aliyun ecs.c8i.24xlarge ECS( https://help.aliyun.com/z…
-
Gibberish is not produced on the previous version with the same request.
### Your current environment
The output of `python collect_env.py`
```plaintext
Collecting environment informatio…
-
### System Info
hi,
i generated the tensorrt llm engine for a llama based model and see that the performance is much worse than vllm.
i did the following:
- compile model with tensorrt llm c…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…