-
volodya
High
# Not appropriate Inferences will be used when calculating the forecast
## Summary
Not appropriate Inferences will be used when calculating the forecast due to not saving filtered resu…
-
Since the ingressroutes(https://github.com/triton-inference-server/server/blob/main/deploy/k8s-onprem/templates/ingressroute.yaml) has been deployed as LB to balance requests across all triton pods. H…
-
I am attempting to use FlexFlow to compare the inference speed to vLLM, but FlexFlow appears to be an order of magnitude slower than vLLM and I've been running into many errors. Testing on a Linux ser…
-
Summary
I would like to propose the addition of constrained decoding support. This feature would allow the output sequence to be constrained by a Finite State Machine (FSM) or Context-Free Grammar (C…
-
It would be nice if we could configure the base url, then people could use offline models via [ollama](https://ollama.com/) or similar tools.
-
I want to deploy a few open source models with the chat UI. I started a simple model with:
```
model=tiiuae/falcon-7b-instruct
volume=$PWD/data # share a volume with the Docker container to avoid…
-
Hi,
I am a member of the DeepFaune and I saw that you are using our model and that you converted it to OpenVino.
Do you have some values of the speed-up it offers ?
-
### Feature request/question
Expose ENV/flag in `lorax-server` and `lorax-launcher` to set base path of adapter during inference.
We currently tried to do a workaround by setting HUGGINGFACE_HUB_C…
-
Current `Cluster` deployment only allows inference servers to be deployed on GPU [see here](https://github.com/fmperf-project/fmperf/blob/b7ae68125724d3c63563fd84eebba7eee347e27f/fmperf/Cluster.py#L13…
-
Im using nvcr.io/nvidia/tritonserver:23.10-py3 container for my inferencing, using C++ GRPC API. There is several models in container, Yolov8-like architecture in Tensorrt plus a few Torchscript model…