Open josiahbryan opened 2 hours ago
We are using upstream llama.cpp server by default:
https://github.com/ggerganov/llama.cpp/tree/master/examples/server
and it does say on that page:
looking for feedback and contributors
But there's also a "--runtime" flag, where the intent is switchable servers, vllm is one that we will support in future, but it is only implemented in --nocontainer mode right now so one must set up vllm themselves
Good to know! Thank you for sharing that. Be really helpful to have a link to the API documentation on the serve page or mentioned in the readme or something because that was not at all obvious that I need to go look for llama.cpp rest documentation when I thought it would be just like the ollama API haha.
Thank you for straightening me out on this one, you can close this issue. I really appreciate it
On Wed, Oct 9, 2024, 9:09 AM Eric Curtin @.***> wrote:
We are using upstream llama.cpp server by default:
https://github.com/ggerganov/llama.cpp/tree/master/examples/server
and it does say on that page:
looking for feedback and contributors https://github.com/ggerganov/llama.cpp/issues/4216
But there's also a "--runtime" flag, where the intent is switchable servers, vllm is one that we will support in future, but it is only implemented in --nocontainer mode right now so one must set up vllm themselves
— Reply to this email directly, view it on GitHub https://github.com/containers/ramalama/issues/265#issuecomment-2402458385, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEZELFFNGDM236GQYJTDTDZ2U2K5AVCNFSM6AAAAABPUT43KSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGQ2TQMZYGU . You are receiving this because you authored the thread.Message ID: @.***>
Care to open a PR to make this point in the README.md and potentially in the ramalama-serve.1.md file.
I haven't had time to fork and make a PR, but here's the patch, hope this helps:
From 17ba92537e15f98e51f85f992df8365afd938ecd Mon Sep 17 00:00:00 2001
From: Josiah Bryan <josiahbryan@gmail.com>
Date: Wed, 9 Oct 2024 09:47:06 -0500
Subject: [PATCH] docs: Added links to llama.cpp REST API documentation and
fixed a spelling error
---
docs/ramalama-serve.1.md | 20 +++++++++++++++++++-
docs/ramalama.1.md | 14 +++++++-------
2 files changed, 26 insertions(+), 8 deletions(-)
diff --git a/docs/ramalama-serve.1.md b/docs/ramalama-serve.1.md
index 20f49de..cd9862b 100644
--- a/docs/ramalama-serve.1.md
+++ b/docs/ramalama-serve.1.md
@@ -1,38 +1,53 @@
% ramalama-serve 1
## NAME
+
ramalama\-serve - serve REST API on specified AI Model
## SYNOPSIS
-**ramalama serve** [*options*] *model*
+
+**ramalama serve** [*options*] _model_
## DESCRIPTION
+
Serve specified AI Model as a chat bot. RamaLama pulls specified AI Model from
registry if it does not exist in local storage.
+## REST API ENDPOINTS
+
+Under the hood, `ramalama-serve` uses the `LLaMA.cpp` HTTP server by default.
+
+For REST API endpoint documentation, see: [https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints)
+
## OPTIONS
#### **--detach**, **-d**
+
Run the container in the background and print the new container ID.
The default is TRUE. The --nocontainer option forces this option to False.
Use the `ramalama stop` command to stop the container running the served ramalama Model.
#### **--generate**=quadlet
+
Generate specified configuration format for running the AI Model as a service
#### **--help**, **-h**
+
show this help message and exit
#### **--name**, **-n**
+
Name of the container to run the Model in.
#### **--port**, **-p**
+
port for AI Model server to listen on
## EXAMPLES
Run two AI Models at the same time, notice that they are running within Podman Containers.
+
$ ramalama serve -p 8080 --name mymodel ollama://tiny-llm:latest 09b0e0d26ed28a8418fb5cd0da641376a08c435063317e89cf8f5336baf35cfa @@ -47,6 +62,7 @@ CONTAINER ID IMAGE COMMAND CREATED
Generate a quadlet for running the AI Model service
+
$ ramalama serve --name MyGraniteServer --generate=quadlet granite > $HOME/.config/containers/systemd/MyGraniteServer.container $ cat $HOME/.config/containers/systemd/MyGraniteServer.container @@ -85,7 +101,9 @@ CONTAINER ID IMAGE COMMAND CREATED
## SEE ALSO
+
**[ramalama(1)](ramalama.1.md)**, **[ramalama-stop(1)](ramalama-stop.1.md)**, **quadlet(1)**, **systemctl(1)**, **podman-ps(1)**
## HISTORY
+
Aug 2024, Originally compiled by Dan Walsh <dwalsh@redhat.com>
diff --git a/docs/ramalama.1.md b/docs/ramalama.1.md
index 382d9ba..10deb3b 100644
--- a/docs/ramalama.1.md
+++ b/docs/ramalama.1.md
@@ -16,12 +16,11 @@ AI Model for your systems setup. This eliminates the need for the user to
configure the system for AI themselves. After the initialization, RamaLama
will run the AI Models within a container based on the OCI image.
-RamaLama first pulls AI Models from model registires. It then start a chatbot
-or a service as a rest API from a simple single command. Models are treated similarly
-to the way that Podman or Docker treat container images.
+RamaLama first pulls AI Models from model registries. It then start a chatbot
+or a service as a rest API (using llama.cpp's server) from a simple single command.
+Models are treated similarly to the way that Podman or Docker treat container images.
-RamaLama supports multiple AI model registries types called transports.
-Supported transports:
+RamaLama supports multiple AI model registries types called transports. Supported transports:
## TRANSPORTS
@@ -107,14 +106,15 @@ store AI Models in the specified directory (default rootless: `$HOME/.local/shar
| [ramalama-push(1)](ramalama-push.1.md) | push AI Models from local storage to remote registries |
| [ramalama-rm(1)](ramalama-rm.1.md) | remove AI Models from local storage |
| [ramalama-run(1)](ramalama-run.1.md) | run specified AI Model as a chatbot |
-| [ramalama-serve(1)](ramalama-serve.1.md) | serve REST API on specified AI Model |
+| [ramalama-serve(1)](ramalama-serve.1.md) | serve REST API on specified AI Model using `llama.cpp` |
| [ramalama-stop(1)](ramalama-stop.1.md) | stop named container that is running AI Model |
| [ramalama-version(1)](ramalama-version.1.md) | display version of RamaLama
## CONFIGURATION FILES
## SEE ALSO
-**[podman(1)](https://github.com/containers/podman/blob/main/docs/podman.1.md)**
+- **[podman(1)](https://github.com/containers/podman/blob/main/docs/podman.1.md)**
+- **[llama.cpp API endpoints](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints)**
## HISTORY
Aug 2024, Originally compiled by Dan Walsh <dwalsh@redhat.com>
--
2.39.3 (Apple Git-146)
This was billed as "ollama compatible", but when I run
ramallama serve -p 11434 llama3.2
- my client code that works with ollama does NOT work (posting to/api/chat
returns 404, and I see the POST hit ramallama in the console as well)Where's API documentation for the actual API served by ramallama? 🙏