containers / ramalama

The goal of RamaLama is to make working with AI boring.
MIT License
198 stars 28 forks source link

No API Documentation #265

Open josiahbryan opened 2 hours ago

josiahbryan commented 2 hours ago

This was billed as "ollama compatible", but when I run ramallama serve -p 11434 llama3.2 - my client code that works with ollama does NOT work (posting to /api/chat returns 404, and I see the POST hit ramallama in the console as well)

Where's API documentation for the actual API served by ramallama? 🙏

ericcurtin commented 2 hours ago

We are using upstream llama.cpp server by default:

https://github.com/ggerganov/llama.cpp/tree/master/examples/server

and it does say on that page:

looking for feedback and contributors

But there's also a "--runtime" flag, where the intent is switchable servers, vllm is one that we will support in future, but it is only implemented in --nocontainer mode right now so one must set up vllm themselves

josiahbryan commented 2 hours ago

Good to know! Thank you for sharing that. Be really helpful to have a link to the API documentation on the serve page or mentioned in the readme or something because that was not at all obvious that I need to go look for llama.cpp rest documentation when I thought it would be just like the ollama API haha.

Thank you for straightening me out on this one, you can close this issue. I really appreciate it

On Wed, Oct 9, 2024, 9:09 AM Eric Curtin @.***> wrote:

We are using upstream llama.cpp server by default:

https://github.com/ggerganov/llama.cpp/tree/master/examples/server

and it does say on that page:

looking for feedback and contributors https://github.com/ggerganov/llama.cpp/issues/4216

But there's also a "--runtime" flag, where the intent is switchable servers, vllm is one that we will support in future, but it is only implemented in --nocontainer mode right now so one must set up vllm themselves

— Reply to this email directly, view it on GitHub https://github.com/containers/ramalama/issues/265#issuecomment-2402458385, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABEZELFFNGDM236GQYJTDTDZ2U2K5AVCNFSM6AAAAABPUT43KSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDIMBSGQ2TQMZYGU . You are receiving this because you authored the thread.Message ID: @.***>

rhatdan commented 1 hour ago

Care to open a PR to make this point in the README.md and potentially in the ramalama-serve.1.md file.

josiahbryan commented 1 hour ago

I haven't had time to fork and make a PR, but here's the patch, hope this helps:


From 17ba92537e15f98e51f85f992df8365afd938ecd Mon Sep 17 00:00:00 2001
From: Josiah Bryan <josiahbryan@gmail.com>
Date: Wed, 9 Oct 2024 09:47:06 -0500
Subject: [PATCH] docs: Added links to llama.cpp REST API documentation and
 fixed a spelling error

---
 docs/ramalama-serve.1.md | 20 +++++++++++++++++++-
 docs/ramalama.1.md       | 14 +++++++-------
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/docs/ramalama-serve.1.md b/docs/ramalama-serve.1.md
index 20f49de..cd9862b 100644
--- a/docs/ramalama-serve.1.md
+++ b/docs/ramalama-serve.1.md
@@ -1,38 +1,53 @@
 % ramalama-serve 1

 ## NAME
+
 ramalama\-serve - serve REST API on specified AI Model

 ## SYNOPSIS
-**ramalama serve** [*options*] *model*
+
+**ramalama serve** [*options*] _model_

 ## DESCRIPTION
+
 Serve specified AI Model as a chat bot. RamaLama pulls specified AI Model from
 registry if it does not exist in local storage.

+## REST API ENDPOINTS
+
+Under the hood, `ramalama-serve` uses the `LLaMA.cpp` HTTP server by default.
+
+For REST API endpoint documentation, see: [https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints)
+
 ## OPTIONS

 #### **--detach**, **-d**
+
 Run the container in the background and print the new container ID.
 The default is TRUE. The --nocontainer option forces this option to False.

 Use the `ramalama stop` command to stop the container running the served ramalama Model.

 #### **--generate**=quadlet
+
 Generate specified configuration format for running the AI Model as a service

 #### **--help**, **-h**
+
 show this help message and exit

 #### **--name**, **-n**
+
 Name of the container to run the Model in.

 #### **--port**, **-p**
+
 port for AI Model server to listen on

 ## EXAMPLES

 Run two AI Models at the same time, notice that they are running within Podman Containers.
+

$ ramalama serve -p 8080 --name mymodel ollama://tiny-llm:latest 09b0e0d26ed28a8418fb5cd0da641376a08c435063317e89cf8f5336baf35cfa @@ -47,6 +62,7 @@ CONTAINER ID IMAGE COMMAND CREATED


 Generate a quadlet for running the AI Model service
+

$ ramalama serve --name MyGraniteServer --generate=quadlet granite > $HOME/.config/containers/systemd/MyGraniteServer.container $ cat $HOME/.config/containers/systemd/MyGraniteServer.container @@ -85,7 +101,9 @@ CONTAINER ID IMAGE COMMAND CREATED


 ## SEE ALSO
+
 **[ramalama(1)](ramalama.1.md)**, **[ramalama-stop(1)](ramalama-stop.1.md)**, **quadlet(1)**, **systemctl(1)**, **podman-ps(1)**

 ## HISTORY
+
 Aug 2024, Originally compiled by Dan Walsh <dwalsh@redhat.com>
diff --git a/docs/ramalama.1.md b/docs/ramalama.1.md
index 382d9ba..10deb3b 100644
--- a/docs/ramalama.1.md
+++ b/docs/ramalama.1.md
@@ -16,12 +16,11 @@ AI Model for your systems setup. This eliminates the need for the user to
 configure the system for AI themselves. After the initialization, RamaLama
 will run the AI Models within a container based on the OCI image.

-RamaLama first pulls AI Models from model registires. It then start a chatbot
-or a service as a rest API from a simple single command. Models are treated similarly
-to the way that Podman or Docker treat container images.
+RamaLama first pulls AI Models from model registries. It then start a chatbot
+or a service as a rest API (using llama.cpp's server) from a simple single command. 
+Models are treated similarly to the way that Podman or Docker treat container images.

-RamaLama supports multiple AI model registries types called transports.
-Supported transports:
+RamaLama supports multiple AI model registries types called transports. Supported transports:

 ## TRANSPORTS
@@ -107,14 +106,15 @@ store AI Models in the specified directory (default rootless: `$HOME/.local/shar
 | [ramalama-push(1)](ramalama-push.1.md)            | push AI Models from local storage to remote registries     |
 | [ramalama-rm(1)](ramalama-rm.1.md)                | remove AI Models from local storage                         |
 | [ramalama-run(1)](ramalama-run.1.md)              | run specified AI Model as a chatbot                        |
-| [ramalama-serve(1)](ramalama-serve.1.md)          | serve REST API on specified AI Model                       |
+| [ramalama-serve(1)](ramalama-serve.1.md)          | serve REST API on specified AI Model using `llama.cpp`     |
 | [ramalama-stop(1)](ramalama-stop.1.md)            | stop named container that is running AI Model              |
 | [ramalama-version(1)](ramalama-version.1.md)      | display version of RamaLama
 ## CONFIGURATION FILES

 ## SEE ALSO
-**[podman(1)](https://github.com/containers/podman/blob/main/docs/podman.1.md)**
+- **[podman(1)](https://github.com/containers/podman/blob/main/docs/podman.1.md)**
+- **[llama.cpp API endpoints](https://github.com/ggerganov/llama.cpp/blob/master/examples/server/README.md#api-endpoints)**

 ## HISTORY
 Aug 2024, Originally compiled by Dan Walsh <dwalsh@redhat.com>
-- 
2.39.3 (Apple Git-146)