-
### Priority
Undecided
### OS type
Ubuntu
### Hardware type
Gaudi2
### Installation method
- [X] Pull docker images from hub.docker.com
- [ ] Build docker images from source
### Deploy method
…
-
## Motivation
WasmEdge is a lightweight inference runtime for AI and LLM applications. The [LlamaEdge project](https://github.com/LlamaEdge) has developed an [OpenAI-compatible API server](https://gi…
-
Hi, Can someone please help me How can I build and use the OnnxRuntime Server in windows which can support gRPC and HTTP.
I have made a c++ api which take a image as input and uses onnx model for I…
-
模型启动方法:python -m vllm.entrypoints.openai.api_server --served-model-name qwen2-7b-instruct --model /app/Qwen2-7B-Instruct --gpu-memory-utilization 0.9
评测方法:swift eval --eval_url http://127.0.0.1:8000/…
-
### Feature request
I would like to be able to use guidelines or other libraries that support constrained output with HF endpoints.
Reference: [A guidance language for controlling large language m…
-
## Is your feature request related to a problem? Please describe.
NVIDIA's Triton inference server provides a feature with which the user is able to load models in multiple GPUs for inference (Nivida…
-
Hello! First of all, great job with this inference engine! Thanks a lot for your work!
Here's my issue: I have run vllm with both a mistral instruct model and it's AWQ quantized version. I've quant…
-
Using FLASHINFER to start VLLM reported an error, enabling -- quantification gptq -- kv cache dtype fp8_e5m2
Start command:
python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 78…
-
I have followed the instructions at https://github.com/ELS-RD/transformer-deploy/#feature-extraction--dense-embeddings to convert a sentence-transformers model (https://huggingface.co/sentence-transfo…
-
Hey all, I have a quick question, is onnxruntime-genai ([https://onnxruntime.ai/docs/genai/api/python.html](https://onnxruntime.ai/docs/genai/api/python.html)) supported in Triton Inference Server's O…