Mozilla-Ocho / llamafile

Distribute and run LLMs with a single file.
https://llamafile.ai
Other
20.08k stars 1.01k forks source link

How to talk to llamafile's OpenAI API endpoint #53

Closed Maxamed closed 11 months ago

Maxamed commented 11 months ago

How to connect to it using API ? i've installed it and it works great but i want to create to it using api

jart commented 11 months ago

Here's an example of how you can talk to the OpenAI Completions API provided by your llamafile server.

Note: Due to a bug in the most recent 0.2.1 release, this example will only work currently if you build llamafile-server at HEAD. You can do that by downloading the cosmocc compiler and put it on your $PATH as discussed in the README. Then run:

make -j8

To build the following program which you'd run:

o//llama.cpp/server/server -m ~/weights/llava-v1.5-7b-Q4_K.gguf

You now have a llamafile server running on localhost port 8080. You can now use its completions API. Here is the quickstart tutorial example that OpenAI provides at https://platform.openai.com/docs/quickstart

curl http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "gpt-3.5-turbo",
       "messages": [
         {
           "role": "system",
           "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
         },
         {
           "role": "user",
           "content": "Compose a poem that explains the concept of recursion in programming."
         }
       ]
     }'

You could put that in a shell script for example, and see something like the following:

jart@studio:~$ ~/scratch/completions-client.sh
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"In programming, recursion is a tool divine,\nA way to solve problems, both big and fine.\nIt's like a loop, but with a twist,\nA function that calls itself, and never quits.\n\nIt starts with an initial case,\nA base to build upon, a place to begin.\nThen, it calls itself, and adds a new case,\nUntil the problem is solved, or the stack is too vast.\n\nRecursion is powerful, and can be quite deep,\nBut with care and practice, it can be a treat.\nIt's a tool that can solve problems, both small and large,\nAnd it's a concept that's worth mastering, to make your code go far.\n\nSo if you're stuck on a problem, and you can't seem to find a way,\nTry recursion, and see what it can do today.\nIt might just be the tool you need, to solve your problem and win,\nWith recursion, you'll be programming, like a true machine.","role":"assistant"}}],"created":1701666907,"id":"chatcmpl-sho1yAtTvl32sAymCUMZvPIYwvm6C1hf","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":227,"prompt_tokens":76,"total_tokens":303}}

You now have your response JSON. It's not very readable on the shell. It's assumed you'd be using your programming language of choice, e.g. Python, and use its appropriate http and json libraries (or some high-level openai client library veneer) to do the actual talking to the server.

This concludes the tutorial. Thanks for reaching out, and enjoy llamafile!

Maxamed commented 11 months ago

Thank you for prompt reply.

I've tried the above, but server crashs:

{"timestamp":1701667611,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080} all slots are idle and system prompt is empty, clear the KV cache {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50501,"status":200,"method":"GET","path":"/","params":{}} {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50502,"status":200,"method":"GET","path":"/completion.js","params":{}} {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50503,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}} {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50501,"status":200,"method":"GET","path":"/index.js","params":{}} llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Users/jj/Projects/llamafile/llava-v1.5-7b-q4-server.llamafile 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0) [1] 82671 abort ./llava-v1.5-7b-q4-server.llamafile

mneedham commented 11 months ago

I tried this as well, using Mistral weights:

o/llama.cpp/server/server \
  -m mistral-7b-instruct-v0.1.Q4_K_M.gguf

Available slots:
 -> Slot 0 - max context: 512

llama server listening at http://127.0.0.1:8080

loading weights...
{"timestamp":1701671568,"level":"INFO","function":"main","line":3045,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62091,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62092,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}

If I try to run the cURL request:

curl http://localhost:8080/v1/chat/completions \
     -H "Content-Type: application/json" \
     -d '{
       "model": "gpt-3.5-turbo",
       "messages": [
         {
           "role": "system",
           "content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
         },
         {
           "role": "user",
           "content": "Compose a poem that explains the concept of recursion in programming."
         }
       ]
     }'

It sort of hangs there, not getting any response. If I go over to the llamafile server tab I see the following output:

slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
Maxamed commented 11 months ago

this works:

curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "stop": null, "messages": [{ "role": "user", "content": "tell me history of canada" }] }'

Maxamed commented 11 months ago

The issue now is CORS issue if you want to interact with it programmatically

jart commented 11 months ago

@Maxamed As mentioned earlier, that eos crash will happen unless you build from source right now. What exactly is the issue with CORS? As far as I know, the server always sends Access-Control-Allow-Origin: *.

basujindal commented 11 months ago

I am also getting a cors error if I call send fetch request to http://127.0.0.1:8080/completition from some other domain (using the Dev tools console tab). Any idea on how I can resolve the error? Thanks!

jart commented 11 months ago

If you can tell me what header we need to add to the server to fix the CORS problem, then I'm happy to add it to the codebase. Thanks!

hiepxanh commented 9 months ago

slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255

context shift issue @mneedham origin from this: https://github.com/ggerganov/llama.cpp/issues/3969

there is nothing to do on llamafile. I just get this error today