-
**Problem Description**
I have different ollama endpoints and I would like to choose from them. Right now I can only configure one. I run smaller models locally and larger models on inference server.…
-
### Motivation
Is there is any endpoint within the API server where we are able to pull the metics like Running Requests
Waiting Requests, Swapped Requests, GPU Cache Usage, CPU Cache Usage, Latency…
-
### Misc discussion on performance
Hi all, I'm having trouble with maximizing the performance of batch inference of big models on vllm 0.6.3
(Llama 3.1 70b, 405b, Mistral large)
My command…
-
## Willingness to contribute
- [ ] Yes. I can contribute this feature independently.
- [x] Yes. I would be willing to contribute this feature with guidance from the MLflow community.
- [ ] No. I ca…
-
**Describe the bug**
When we start `serve_reward_model.py` and run annotation, the server goes down during processing. It will crash on specific samples. These samples have a long context.
[error.lo…
-
同一个视频,在windows是好的,ubuntu上报错
RROR:root:An error occurred: choose a window size 400 that is [2, 160] | 0/24 [00:00
-
I get the folliwng error when starting inference server:
TypeError: Invalid type for device_requests param: expected list but found
Any pointers would be appreciated.
-
### What happened?
Within the worker, we map to the predibase base URL. It says https://api.app.predibase.com but it should be https://serving.app.predibase.com
Also, the model and usage are retur…
-
The spec-infer works well for batch size (1,2,4,8,16). But I change the batch size to 32, it turns out to be "stack smashing detected"
```+ ncpus=16
+ ngpus=1
+ fsize=30000
+ zsize=60000
+ max_se…
-
I was trying to run the DLRMv2 benchmark of MLPerf Inference on an ARM server using the instructions [here]( https://docs.mlcommons.org/inference/benchmarks/recommendation/dlrm-v2/#__tabbed_15_1).
…