-
https://docs.google.com/presentation/d/1jxj9zjeRRu1BJSf8tzaWoQVr5h7HZbOUsSh39Rcwv80/edit#slide=id.g10be2c57ddf\_7\_3
Aha! Link: https://nvaiinfa.aha.io/features/MERLIN-672
-
-
### Proposal to improve performance
Improve bitsandbytes quantization inference speed
### Report of performance regression
I'm testing llama-3.2-1b on a toy dataset. For offline inference using the…
-
**Problem Statement**
The SDK currently requires users to create specific object types (like EndpointCoreConfigInput, AiGatewayConfig, RateLimit, EndpointTag) when e.g. creating a serving endpoint (s…
-
Hi,
Thank you for your awesome repository, it's helps me so much on my personal project :100: :+1:
I create this issue just want to share my code for serving model with Tensorflow Serving's gRPC. H…
-
Provide Pros, Cons, and final recommendation(s)
https://docs.bentoml.org/en/latest/
-
SQLFlow extends the SQL syntax to describe the end-to-end machine learning pipeline.
The end-to-end solution includes the model serving. The data transformation logic is consistent between training a…
-
### System Info
It meets requirements.txt. Nvidia GeForce GPU.
### Information
- [X] The official example scripts
- [ ] My own modified scripts
### 🐛 Describe the bug
I'm using the remote::vllm s…
-
**What happened**:
根据README的样例提交任务时,nvidia.com/gpu值超过1,pod的状态就一直为Pending
nvidia.com/gpu值为1时,pod调度正常
**What you expected to happen**:
nvidia.com/gpu值大于1,pod调度正常
**How to reproduce it (as minim…
-
### 🚀 The feature, motivation and pitch
This paper might be of interest: https://arxiv.org/pdf/2305.05920.pdf
This paper improves inference efficiency by determining the priority of each inference…