-
## Describe the bug
Download https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.2/mistralrs-server-aarch64-apple-darwin.tar.xz
Use a tool like [asitop](https://github.com/tlkh/asito…
-
When I run glm-4-9b-chat-Q5_K_M.gguf on the Cuda 12 machine, the API server can be started successfully. However, when I send a question, the API server will crash.
The command I used to start the …
-
```yaml
services:
tabby:
restart: always
image: tabbyml/tabby
entrypoint: /opt/tabby/bin/tabby-cpu
command: serve --model StarCoder-1B --chat-model Qwen2-1.5B-Instruct
vol…
-
### Summary
I'm trying to run llama-api-server with Llama-3-8B in wasm-sandboxer of Kuasar! But got some errors.
### Current State
_No response_
### Expected State
_No response_
### …
-
**Describe the bug**
I'm noticing below error with our Tabby deployment, looks like a memory error. Don't have any additional logs, since we've modified the logs to mask input, output information, th…
-
**Please describe the feature you want**
Related: https://github.com/TabbyML/tabby/issues/2652
This allows a local deployment could use fewer vram / computing in local setup
**Additional co…
-
Hi there,
First thank you for unsloth, it's great!
I've finetuned a llama-3-8b-Instruct-bnb-4bit and pushed it to hf hub. When I try to deploy it using [hf Inference Endpoints](https://huggingfa…
-
On adding llama_cpp-rs to my Cargo.toml, llama.cpp seems to be locked to an older version. I'm trying to use Phi-3 128k in a project and I'm unable to because the [PR that was merged into llama.cpp](h…
-
### System Info
I am trying to run TGI on Docker using 8 GPUs with 16GB each (In-house server) . Docker works fine with using single GPU.
My server crashes when using all GPUs. is there any other wa…
-
Hello, I'm not sure if multi GPU is supported yet. I didn't find parameters for tensor parallel, and the "num_device_layers" parameter seems not work. Please let me know if it supports or has plans to…