-
Hi, your DistServe paper is pretty good and insightful, thanks a lot for the ideas and implementations! These days, I got an idea exploring further into the domain of prefill-decode disaggregation, b…
-
### Motivation
This library https://github.com/mit-han-lab/qserve introduces W4A8KV4 Quantization method, called (https://arxiv.org/abs/2405.04532) as QoQ in the paper, which **delivers performance g…
-
Got some suggestions for frameworks for self-hosted serving of llm and related.
# Embeddings from OpenAI clip.
Jina
https://github.com/jina-ai/clip-as-service (Apache)
# Text-embeddings:
My o…
-
### System Info
## Description
I'm am building the DJL-Serving TensorRT-LLM LMI inference container from scratch, and deploying on Sagemaker Endpoints for Zephyr-7B model. Unfortunately, I run i…
-
In the guide it says:
> Building from source code is necessary if you want the best performance
https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html
I have a custom s…
-
### Is there an existing issue for the same bug?
- [X] I have checked the existing issues.
### Branch name
main
### Commit ID
83803a7
### Other environment information
```Markdown…
-
### Required prerequisites
- [X] I have searched the [Issue Tracker](https://github.com/camel-ai/camel/issues) and [Discussions](https://github.com/camel-ai/camel/discussions) that this hasn't alre…
-
Hello,
I've encountered an issue where the request launcher does not allow the next requests to be sent until all requests specified by `num_concurrent_requests` have finished.
This behavior see…
-
Hi,
Could you please help provide guide on integrating deepspeed approach of using multi-GPU Intel Flex 140 to run model inference using FastAPI and uvicorn setting ?
model id: 'meta-llama/Llama-2-7…
-
Hi there,
I am wondering what hardware does ray use for serving in this llmperf leaderboard. Is it cpu or gpu? if it is GPU what's the model?
Thanks,
Fizzbb