-
### Priority
Undecided
### OS type
Ubuntu
### Hardware type
Xeon-SPR
### Installation method
- [X] Pull docker images from hub.docker.com
- [ ] Build docker images from source
### Deploy metho…
-
### System Info
Built tensorrtllm_backend from source using dockerfile/Dockerfile.trt_llm_backend
tensorrt_llm 0.13.0.dev2024081300
tritonserver 2.48.0
triton image: 24.07
Cuda 12.5
### Wh…
-
### System Info
Hello TensorRT-LLM team! 👋 I'm facing an issue where the inference output does not contain the expected "Singapore" text. Below are the details of my setup and steps to reproduce the …
-
`top` reports 100% single-core CPU usage when inferring LLMs both with exllamav2 and llama.cpp.
Digging with `perf`, it seems this loads is coming from `libhsa-runtime64`, specifically, from a sing…
-
Hello,
Similarly to #3, I've tried reproducing the `demo.py` benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.
It was mentioned this is du…
-
Could anyone please advise if it is possible to run inference with OVIS 1.6 on a single 4090 GPU? After loading the model, it appears to consume approximately 20GB of VRAM. I attempted an inference, b…
-
你好:
我观察到yaml里有llm_model,这在您的说明中表示llm_model的ckpt,但是这/data/llava-v1.5-7b似乎是一个文件夹,从代码来看似乎需要tokenizer和llm的ckpt
self.llm_tokenizer = LlamaTokenizer.from_pretrained(llm_model, use_fast=False, truncati…
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x ] I am running the latest code. Development is very rapid so there are no tagged versions as o…
-
```
llm = LLM('/app/models/tensorrt_llm', skip_tokenizer_init=True)
sampling_params = SamplingParams(end_id=2, return_context_logits=True, max_new_tokens=1)
results = llm.generate([[32, 12,24,54,6,…
-
**Describe the bug**
When generating responses using a local llm, cortex-cpp still seems to use CPU.
https://discord.com/channels/1107178041848909847/1149558035971321886/1253148982188838954
**To …