eth-sri / lmql

A language for constraint-guided and efficient LLM programming.
https://lmql.ai
Apache License 2.0
3.62k stars 195 forks source link

Override Open AI API Base with llama.cpp mock server #209

Open spyderman4g63 opened 1 year ago

spyderman4g63 commented 1 year ago

I have a local server running an OpenAI compatible API. I simply want all requests that normally go to api.openai.com:443 go to localhost:8000.

I did see that you should be able to overide models for Azure. I am hoping to use that but it still seems to make calls to openai.com.

import lmql

@lmql.query
async def test():
    '''lmql
    argmax "Hello [WHO]" from my_model
    '''

my_model = lmql.model(    
    "openai/gpt-3.5-turbo", 
    api_base="http://localhost:8000" 
)

items = await test()
items[0]

I still see it getting errors from openai:

Failed with Cannot connect to host api.openai.com:443 ssl:default [nodename nor servname provided, or not known]
OpenAI API: Underlying stream of OpenAI complete() call failed with error <class 'aiohttp.client_exceptions.ClientConnectorError'> Cannot connect to host api.openai.com:443 ssl:default [nodename nor servname provided, or not known] Retrying... (attempt: 0)

Is there a way to overide the OpenAI url?

lbeurerkellner commented 12 months ago

Hi there :) api_base is reserved for Azure OpenAI configuration only. To change the general endpoint, you can just specify endpoint=<ENDPOINT>. This should probably be aligned, such that api_base can also be used for non-Azure endpoints, so thanks for raising this.

What mock server implementation are you using? In my experience true OpenAI API compliance is rare, so there may be other issue, as LMQL assume e.g. working batching and logit_bias support. Let me know how it goes.

spyderman4g63 commented 12 months ago

I'm using llama.cpp's server. I wasn't sure if I could use all of the params that I am using with lmql's server. I couldn't find any doc on it. I use this command to host a version of llama70b locally:

export N_GQA=8 && python3 -m llama_cpp.server --model /Users/jward/Projects/llama.cpp/models/llama-2-70b-orca-200k.Q5_K_M.gguf --use_mlock True --n_gpu_layers 1 _llama.cpp python has a bug for ngqa so I have to set the env var for it.

In general, I'm able to use Open AI's python library if I override openai.api_base

spyderman4g63 commented 12 months ago

This is probably outside of the scope, but I do see some activity with this code:

import lmql
import os
os.environ['OPENAI_API_KEY'] = 'fakekey'

@lmql.query
async def test():
    '''lmql
    argmax 
        "Say 'this is a test':[RESPONSE]" 
    from 
        lmql.model("gpt-4", endpoint="http://localhost:8000")
    '''

items = await test()
items[0]

The server log shows it trying to post to /v1 and getting a 404, which is the reponse I expect.

INFO:     ::1:52619 - "POST /v1 HTTP/1.1" 404 Not Found

OpenAI's python lib doesn't try to call /v1. It just hits /v1/chat/completions and works fine.

INFO:     ::1:52768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
lbeurerkellner commented 12 months ago

Yes, the endpoint parameter expects the full path to the resources to hit for completions, e.g. try appending the required /v1/chat/completions. For your lmql serve-model command, you have to prepend the llama.cpp: prefix, otherwise it will try to load your model via transformers. See also https://docs.lmql.ai/en/stable/language/llama.cpp.html#model-server.

spyderman4g63 commented 12 months ago

I think I was able to get the model to serve using, though it doesn't log output:

lmql serve-model llama.cpp:/Users/jward/Projects/llama.cpp/models/llama-2-13b-chat.Q8_0.gguf --use_mlock True --n_gpu_layers 1 

It looks like the tokenizer isn't correct. Is there a way to set that?

AssertionError: Cannot set dclib tokenizer to hf-huggyllama/llama-7b because it is already set to tiktoken-gpt2 (cannot use multiple tokenizers in the same process for now)
lbeurerkellner commented 11 months ago

The latest main now actually finally supports mixing tokenizers in the same process. I am not sure, however, how this will work with the OpenAI endpoint parameter. I think we hard-code the GPT tokenizers. I have never seen an alternative implementation of the OpenAI API that actually implemented logit_bias, so this never came up before.

Did your lmql serve-model command end up working? Note that there is a "--verbose" option. I have seen issue like this before, when the GPU support was not compiled correctly. Can you test it, e.g. with llama-cpp-python?

spyderman4g63 commented 11 months ago

I appreciate your patience with me as I jumped a few topics. This command ended up working for me:

lmql serve-model llama.cpp:/Users/jward/Projects/llama.cpp/models/llama-2-70b-orca-200k.Q5_K_M.gguf --use_mlock True --n_gpu_layers 1 --n_gqa 8 --n_ctx 4096

Verbose output is also very helpful. Thanks again.

tranhoangnguyen03 commented 11 months ago

I'm running into the same issue.

# set the API type based on whether you want to use a completion or chat endpoint
os.environ['OPENAI_API_TYPE']='azure' 
os.environ['OPENAI_API_BASE']="https://8b78-34-125-163-134.ngrok-free.app"
os.environ['OPENAI_API_KEY']="fake_key"

@lmql.query(model='v1')
async def chain_of_thought(question):
    '''lmql
    # Q&A prompt template
    "Q: {question}\n"
    "A: Let's think step by step.\n"
    "[REASONING]"
    "Thus, the answer is:[ANSWER]."

    # return just the ANSWER to the caller
    return ANSWER
    '''
res =  await chain_of_thought('Today is the 12th of June, what day was it 1 week ago?')
print(res)
TokenizerNotAvailableError: Failed to locate a suitable tokenizer implementation for 'v1' (Make sure your current environment provides a tokenizer backend like 'transformers', 'tiktoken' or 'llama.cpp' for this model)

If I switch to model='gpt-4' then the llama_cpp server outputs this:

INFO:     35.230.48.2:0 - "POST /openai/deployments/gpt-4/completions?api-version=2023-05-15 HTTP/1.1" 404 Not Found
lbeurerkellner commented 11 months ago

@tranhoangnguyen03 The first error you get here indicates that LMQL cannot automatically derive a tokenizer from the model name v1. You can fix this by using a lmql.model("v1", tokenizer=<tokenizer name>) object as model instead.

tranhoangnguyen03 commented 11 months ago

I tried:

@lmql.query(model=
    lmql.model("v1", 
        tokenizer='HuggingFaceH4/zephyr-7b-alpha', 
        api_type="azure",
        api_base="https://932d-34-141-210-25.ngrok-free.app"
    )
)

And got this:

RuntimeError: LMTP client encountered an error: Exception Server disconnected attempting to communicate with lmtp endpoint: http://localhost:8080/. Please check that the endpoint is correct and the server is running.

Then I tried:

@lmql.query(model=
    lmql.model("v1", 
        tokenizer='HuggingFaceH4/zephyr-7b-alpha', 
        endpoint="https://932d-34-141-210-25.ngrok-free.app"
    )
)

And got this error:

RuntimeError: LMTP client encountered an error: Exception 403, message='Invalid response status', url=URL('https://932d-34-141-210-25.ngrok-free.app/') attempting to communicate with lmtp endpoint: https://932d-34-141-210-25.ngrok-free.app/. Please check that the endpoint is correct and the server is running.

Am I using the wrong kwarg here?

lbeurerkellner commented 11 months ago

Ah yes, you have to use openai/v1, so LMQL considers your model an OpenAI model. Without specifying this, it will attempt to load v1 as HuggingFace model.

In general, are you also trying to use a llama.cpp-based OpenAi mock endpoint? With what model are you trying this. Let me know, so I can try to reproduce your setup here.

tranhoangnguyen03 commented 11 months ago

That is correct. I'm running a zephyr-7b-alpha.Q6_K.gguf model on Google Colab which I tunnel to a ngrok Public IP.

Here's an image showing the API endpoints: image

Here's an example Curl call:

curl -X 'POST' \
  'https://932d-34-141-210-25.ngrok-free.app/v1/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "\n\n### Instructions:\nWhat is the capital of France?\n\n### Response:\n",
  "stop": [
    "\n",
    "###"
  ]
}'
lbeurerkellner commented 11 months ago

I managed to get the server connection to work, via

argmax(verbose=True)
    "[[INST]]Say 'this is a test':[[/INST]]\n[RESPONSE]" 
from
    lmql.model("openai/v1", tokenizer='gpt2', endpoint="<host>/2600/v1/completions")
where 
    len(TOKENS(RESPONSE)) < 120 and STOPS_BEFORE(RESPONSE, "[INST]") and not "\n" in RESPONSE

You don't have to use Azure API configuration, you can actually just specify the endpoint (which includes the /v1/completions suffix). However, unfortunately this does not work as intended, since LMQL uses the echo parameter available with the official OpenAI API, but not with the mock implementation llama.cpp provides. At least from the logs I can see, that llama.cpp does not respect this parameter, i.e. does not echo the prompt tokens.

This means LMQL for now does not support the mock implementation llama.cpp provides, because it does not implement it in a fully compliant manner. Hopefully this can be fixed on their end, as far as I skimmed the code, it does seem to implement logit_bias properly, which is typically the harder thing to get with these kind of mock APIs. Maybe experimenting some more with echo and then creating an issue over there could be a good way to resolve this.

Workaround Until then, I can encourage you to use LMQL's official llama.cpp backend, which can also be served via colab, and then accessed locally. In my experiments this works seamlessly using

lmql.model("llama.cpp:/home/luca/repos/models/zephyr-7b-alpha.Q6_K.gguf", endpoint="<HOST>:<PORT>", tokenizer="HuggingFaceH4/zephyr-7b-alpha")

and

lmql serve-model llama.cpp:/home/luca/repos/models/zephyr-7b-alpha.Q6_K.gguf --n_gpu_layers 30 --host <HOST> --port <PORT>

If you can't launch this via the command line, you can also use lmql.serve, see this snippet for details..

tranhoangnguyen03 commented 11 months ago

@lbeurerkellner I've read over the Language Model Transport Protocol (LMTP) documentation, and it seems to me that the server is designed to work with a client deployed locally on the same machine? Does that means there's no support for a external model endpoint at the moment?

lbeurerkellner commented 11 months ago

LMTP is typically used with client and server being different machines (e.g. with the server being some beefy GPU machine and client being a laptop). Note, however, that LMTP does not implement authentication mechanisms, so you want to protect the communication with e.g. an SSH tunnel.

ebudmada commented 7 months ago

does anyone manage to make it works, llama.cpp server with lmql? llama-cpp-python is full of bug, using llama.cpp server will be solving a lot of problem

Thank you