Closed Maxamed closed 11 months ago
Here's an example of how you can talk to the OpenAI Completions API provided by your llamafile server.
Note: Due to a bug in the most recent 0.2.1 release, this example will only work currently if you build llamafile-server at HEAD. You can do that by downloading the cosmocc compiler and put it on your $PATH as discussed in the README. Then run:
make -j8
To build the following program which you'd run:
o//llama.cpp/server/server -m ~/weights/llava-v1.5-7b-Q4_K.gguf
You now have a llamafile server running on localhost port 8080. You can now use its completions API. Here is the quickstart tutorial example that OpenAI provides at https://platform.openai.com/docs/quickstart
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
You could put that in a shell script for example, and see something like the following:
jart@studio:~$ ~/scratch/completions-client.sh
{"choices":[{"finish_reason":"stop","index":0,"message":{"content":"In programming, recursion is a tool divine,\nA way to solve problems, both big and fine.\nIt's like a loop, but with a twist,\nA function that calls itself, and never quits.\n\nIt starts with an initial case,\nA base to build upon, a place to begin.\nThen, it calls itself, and adds a new case,\nUntil the problem is solved, or the stack is too vast.\n\nRecursion is powerful, and can be quite deep,\nBut with care and practice, it can be a treat.\nIt's a tool that can solve problems, both small and large,\nAnd it's a concept that's worth mastering, to make your code go far.\n\nSo if you're stuck on a problem, and you can't seem to find a way,\nTry recursion, and see what it can do today.\nIt might just be the tool you need, to solve your problem and win,\nWith recursion, you'll be programming, like a true machine.","role":"assistant"}}],"created":1701666907,"id":"chatcmpl-sho1yAtTvl32sAymCUMZvPIYwvm6C1hf","model":"gpt-3.5-turbo-0613","object":"chat.completion","usage":{"completion_tokens":227,"prompt_tokens":76,"total_tokens":303}}
You now have your response JSON. It's not very readable on the shell. It's assumed you'd be using your programming language of choice, e.g. Python, and use its appropriate http and json libraries (or some high-level openai client library veneer) to do the actual talking to the server.
This concludes the tutorial. Thanks for reaching out, and enjoy llamafile!
Thank you for prompt reply.
I've tried the above, but server crashs:
{"timestamp":1701667611,"level":"INFO","function":"main","line":3039,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080} all slots are idle and system prompt is empty, clear the KV cache {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50501,"status":200,"method":"GET","path":"/","params":{}} {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50502,"status":200,"method":"GET","path":"/completion.js","params":{}} {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50503,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}} {"timestamp":1701667611,"level":"INFO","function":"log_server_request","line":2591,"message":"request","remote_addr":"127.0.0.1","remote_port":50501,"status":200,"method":"GET","path":"/index.js","params":{}} llama.cpp/server/json.h:21313: assert(it != m_value.object->end()) failed (cosmoaddr2line /Users/jj/Projects/llamafile/llava-v1.5-7b-q4-server.llamafile 1000000fe3c 1000001547c 100000162e8 10000042748 1000004ffdc 10000050cb0 1000005124c 100000172dc 1000001b370 10000181e78 1000019d3d0) [1] 82671 abort ./llava-v1.5-7b-q4-server.llamafile
I tried this as well, using Mistral weights:
o/llama.cpp/server/server \
-m mistral-7b-instruct-v0.1.Q4_K_M.gguf
Available slots:
-> Slot 0 - max context: 512
llama server listening at http://127.0.0.1:8080
loading weights...
{"timestamp":1701671568,"level":"INFO","function":"main","line":3045,"message":"HTTP server listening","hostname":"127.0.0.1","port":8080}
all slots are idle and system prompt is empty, clear the KV cache
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62091,"status":200,"method":"GET","path":"/completion.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62090,"status":200,"method":"GET","path":"/index.js","params":{}}
{"timestamp":1701671568,"level":"INFO","function":"log_server_request","line":2592,"message":"request","remote_addr":"127.0.0.1","remote_port":62092,"status":200,"method":"GET","path":"/json-schema-to-grammar.mjs","params":{}}
If I try to run the cURL request:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-3.5-turbo",
"messages": [
{
"role": "system",
"content": "You are a poetic assistant, skilled in explaining complex programming concepts with creative flair."
},
{
"role": "user",
"content": "Compose a poem that explains the concept of recursion in programming."
}
]
}'
It sort of hangs there, not getting any response. If I go over to the llamafile server tab I see the following output:
slot 0 is processing [task id: 0]
slot 0 : kv cache rm - [0, end)
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
this works:
curl http://127.0.0.1:8080/v1/chat/completions -H "Content-Type: application/json" -d '{ "stop": null, "messages": [{ "role": "user", "content": "tell me history of canada" }] }'
The issue now is CORS issue if you want to interact with it programmatically
@Maxamed As mentioned earlier, that eos
crash will happen unless you build from source right now. What exactly is the issue with CORS? As far as I know, the server always sends Access-Control-Allow-Origin: *
.
I am also getting a cors error if I call send fetch request to http://127.0.0.1:8080/completition
from some other domain (using the Dev tools console tab). Any idea on how I can resolve the error? Thanks!
If you can tell me what header we need to add to the server to fix the CORS problem, then I'm happy to add it to the codebase. Thanks!
slot 0 is processing [task id: 0] slot 0 : kv cache rm - [0, end) slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255 slot 0: context shift - n_keep = 0, n_left = 510, n_discard = 255
context shift issue @mneedham origin from this: https://github.com/ggerganov/llama.cpp/issues/3969
there is nothing to do on llamafile. I just get this error today
How to connect to it using API ? i've installed it and it works great but i want to create to it using api