-
As per mxnet inference doc, the main dispatcher thread is single threaded. https://cwiki.apache.org/confluence/display/MXNET/Parallel+Inference+in+MXNet
**How does mxnet model server handle multipl…
-
-
### System Info
I tried the following systems, both with the same exception:
- ghcr.io/huggingface/text-generation-inference:sha-6aebf44 locally with docker on nvidia rtx 3600
- ghcr.io/huggingface…
-
Hello, I have launched the opt-125M inference, and send request to server with locust. but what ever config the max_batch_size, the InferenceEngine always run in batch_size =1. how can i use the dynam…
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [ ] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
paddle-serving-app 0.9.0
paddle-serving-client 0.9.0
paddle-serving-server-gpu 0.9.0.post112
paddlepaddle-gpu 2.6.0.post112
Traceback (most recent call last):
File "/mnt…
-
When I run the models_server.py in aws , OSError: [Errno 99] Cannot assign requested address.
How can I deploy the service on the cloud server and I download all model in cloud .
And if i set config…
-
"I have tried to run it locally with an M-Series mac but the image is crashing as soon as I perform a request.
Tested against a ollama model served locally as well as a granite model served on MaaS"
…
-
Hi,
I'm using MLServer with KServe, and found that the proto descriptor in grpc has a collision between them:
```
File ~/.cache/pypoetry/virtualenvs/example-mlflow-lZ2hGP5g-py3.10/lib/python3.10/…
-
Is warmup supported for the `tensorrtllm_backend`? If so it would be nice to have an example of how to upload LoRa adapters as a warmup step.