inference-server Search Results

1000+ results
for inference-server

Best match

Best match Most commented Newest Recently updated Least commented Oldest Least recently updated

apache/incubator-seata #6882

`org.apache.seata:seata-mock-server` depends on the non-exis…

- [x] I have searched the [issues](https://github.com/seata/seata/issues) of this repository and believe that this is not a duplicate. ### Ⅰ. Issue Description - `org.apache.seata:seata-mock…

linghengqian updated 20 hours ago
4
SNU-ARC/any-precision-llm #7

No real speedup from any-precision-llm kernels

Hello, Similarly to #3, I've tried reproducing the `demo.py` benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions. It was mentioned this is du…

pgimenes updated 3 days ago
2
microsoft/DeepSpeed-MII #435

Can DeepSpeed-MII inference on multi gpus with only 1 replic…

I have 2 nodes, each with a 16GB GPU. And I want to run the llama-2-13b-hf model on these 2 nodes with 1 replica. cat /job/hostfile: ``` deepspeed-mii-inference-worker-0 slots=1 deepspeed-mii-in…

gujingit updated 5 months ago
2
ggerganov/llama.cpp #9493

Feature Request: RDMA support for rpc back ends

### Prerequisites - [X] I am running the latest code. Mention the version if possible as well. - [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.…

slavonnet updated 2 weeks ago
2
triton-inference-server/server #6177

Triton replication on Kubernetes, all traffic forwarded to t…

**Description** I deployed Triton Inference Server on Kubernetes (GKE). To balance the load, I created a Load Balancer Service. As a client, I'm using the Python HTTP client. I was expecting all the …

Vincouux updated 4 days ago
5
huggingface/text-generation-inference #2297

AttributeError: 'NoneType' object has no attribute 'replace'

### System Info Docker image: ghcr.io/huggingface/text-generation-inference:2.2.0-rocm Hardware: AMD MI250 ### Information - [X] Docker - [ ] The CLI directly ### Tasks - [x] An officially suppo…

almersawi updated 2 months ago
2
ollama/ollama #6456

Ollama not using 20GB of VRAM from Tesla P40 card

### What is the issue? Not sure if this is a bug, damaged hardware, or a driver issue but I thought I would report it just in case. Ollama sees 23.7GB available on each card when it detects them, bu…

Happydragun4now updated 1 month ago
5
YunchaoYang/Blogs #56

Serve LLM models

A few options to explore 1. NVIDIA NeMo, TensorRT_LLM, Triton - NeMo Run [this Generative AI example](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/models/Gemma ) to build Lora wi…

YunchaoYang updated 2 weeks ago
7
triton-inference-server/server #7583

TritonSever does not register vLLM metrics

**Description** A clear and concise description of what the bug is. I'm running Triton Inference Server with vLLM backend as a container on Kubernetes. I followed the [Triton metrics documentatio…

ratnopamc updated 1 month ago
7
open-ce/open-ce #1179

[FEEDSTOCK REQUEST] llama.cpp & llama-cpp-python

**Describe the package you'd like added** `llama.cpp` has become a popular inference server for LLMs. Additionally, `llama-cpp-python` is commonly used to connect from Python to `llama.cpp`. - `l…

lehrig updated 3 days ago
2

上一页 1...17 18 19 20 21 22 23...100 下一页

1000+ results for inference-server

1000+ results
for inference-server