-
- [x] I have searched the [issues](https://github.com/seata/seata/issues) of this repository and believe that this is not a duplicate.
### Ⅰ. Issue Description
- `org.apache.seata:seata-mock…
-
Hello,
Similarly to #3, I've tried reproducing the `demo.py` benchmark on an H100 and an A6000 and I'm also seeing no speedup on these platforms at lower precisions.
It was mentioned this is du…
-
I have 2 nodes, each with a 16GB GPU. And I want to run the llama-2-13b-hf model on these 2 nodes with 1 replica.
cat /job/hostfile:
```
deepspeed-mii-inference-worker-0 slots=1
deepspeed-mii-in…
-
### Prerequisites
- [X] I am running the latest code. Mention the version if possible as well.
- [X] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.…
-
**Description**
I deployed Triton Inference Server on Kubernetes (GKE). To balance the load, I created a Load Balancer Service. As a client, I'm using the Python HTTP client. I was expecting all the …
-
### System Info
Docker image: ghcr.io/huggingface/text-generation-inference:2.2.0-rocm
Hardware: AMD MI250
### Information
- [X] Docker
- [ ] The CLI directly
### Tasks
- [x] An officially suppo…
-
### What is the issue?
Not sure if this is a bug, damaged hardware, or a driver issue but I thought I would report it just in case.
Ollama sees 23.7GB available on each card when it detects them, bu…
-
A few options to explore
1. NVIDIA NeMo, TensorRT_LLM, Triton
- NeMo
Run [this Generative AI example](https://github.com/NVIDIA/GenerativeAIExamples/tree/main/models/Gemma
) to build Lora wi…
-
**Description**
A clear and concise description of what the bug is.
I'm running Triton Inference Server with vLLM backend as a container on Kubernetes.
I followed the [Triton metrics documentatio…
-
**Describe the package you'd like added**
`llama.cpp` has become a popular inference server for LLMs. Additionally, `llama-cpp-python` is commonly used to connect from Python to `llama.cpp`.
- `l…