-
### Search before asking
- [X] I had searched in the [issues](https://github.com/ray-project/kuberay/issues) and found no similar feature requirement.
/cc Bytedancer @Basasuya @Yicheng-Lu-llll
…
-
Hey, this project seems really interesting. Currently hardly there any competitor to chatGPT advanced voice mode, but this seems to be in the same direction.
Currently device being used is cuda, ca…
-
### Description
The cohere rerank implementation allows configuring fields that probably don't apply. The implementation leverages the common settings here: https://github.com/elastic/elasticsearch/b…
-
### Description
Customer is interested in using Elasticsearch inference API with text generation models on Hugging Face where as of 8.15 we are limited supporting only text_embeddings
-
0x416
Medium
# Lack of error handling when making blockless api call
## Summary
Lack of error handling when making blockless api call
## Vulnerability Detail
Error handling when making blockless…
-
### 🚀 The feature, motivation and pitch
I launched a LLM service by vllm, and I use AsyncOpenAI function for high throughput output. like this:
`
async def async_llm_infer_sampling(prompt, a…
-
While we support batched inference like other constrained decoding libraries, the current implementation can be parallelized further. In particular, we can mask logits in batch and run several `kbnf` …
-
I want to perform inference on quantized LLAMA (W8A16) on ARM-v9 (with SVE) using oneDNN. The LLAMA weights are per-group quantized.
Based on my understanding, I need to prepack the weights to redu…
-
**What would you like to be added**:
Add the group size as an env var
**Why is this needed**:
In most cases for multi-host inference, the size is needed, like in vllm.
Suggest to use LWS_G…
ahg-g updated
6 hours ago
-
### OpenVINO Version
2021.2.1.0
### Operating System
Windows System
### Device used for inference
CPU
### OpenVINO installation
Build from source
### Programming Language
C++
### Hardware Ar…