-
### Your current environment
When testing vLLM I noticed that sometimes when I have a client makes a request and the client terminates abnormally, the request still is shown as running on the vLLM se…
cduk updated
5 months ago
-
### Description
```shell
Hi everyone!
I tried to reproduce the code from https://github.com/triton-inference-server/fastertransformer_backend/blob/dev/t5_gptj_blog/notebooks/GPT-J_and_T5_inference.i…
-
Hello,
I am using a lot of ensemble models in production and the biggest pain point I have is that in TensorRT it is impossible to index tensors when the index is an input.
Hence to bypass thi…
-
**Describe the bug**
I get the error *Cannot load `gptq` weight for GPTQ -> Marlin repacking, make sure the model is already quantized* when i inference gptq quantized model DeepSeekCoderV2 with T…
-
gpu-rest-engine-master$ nvidia-docker run --name=server --net=host --rm inference_server
2018/09/18 02:31:30 Initializing TensorRT classifiers
I am just trying to get the TensorRT server started a…
-
Here is the details of a major change that I wish to implement in the library.
Currently, meteorite takes in HTTP requests, passes the data into the callback and sends the HTTP response back to the…
-
## Use Case
Onboarding Seldon Core makes sense in consideration of MLFlow, Kubeflow, Grafana/Prometheus as possible Inference server. This combination can either run standalone or e.g. in combination…
-
**Describe the bug**
When model failed to registered because network error, when re-register this model, `sllm-server` reports model is already registered, and cannot remove model by using `sllm-cli …
-
### Priority
P2-High
### OS type
Ubuntu
### Hardware type
Xeon-SPR
### Installation method
- [X] Pull docker images from hub.docker.com
- [ ] Build docker images from source
#…
-
Hi, during request streaming it'll be helpful to have a flag to indicate end of generation. Can you help with this feature request?
I believe that means returning the bool flag from https://github.…