ShishirPatil / gorilla

Gorilla: Training and Evaluating LLMs for Function Calls (Tool Calls)
https://gorilla.cs.berkeley.edu/
Apache License 2.0
11.17k stars 918 forks source link

[feature] Guidance on Self-Hosting API Endpoints #146

Closed kcolemangt closed 7 months ago

kcolemangt commented 9 months ago

Is the feature request related to a problem? Self-hosting Gorilla's API endpoints for enhanced control and independence, as opposed to relying on Gorilla's hosted services.

Describe the solution you'd like Guidance or documentation on how to independently set up and manage Gorilla's API endpoints in the OpenFunctions examples.

Additional context This request stems from discussions in Issue #144. Insights from @ShishirPatil and @rgbkrk would be particularly valuable.

ShishirPatil commented 9 months ago

Thanks @kcolemangt for raising this. Just updated the README, but basically this prompt is what we used to train the model, and using this should be good? Let me know if you run into any issues.

def get_prompt(user_query, functions=[]):
  if len(functions) == 0:
    return f"USER: <<question>> {user_query}\nASSISTANT: "
  functions_string = json.dumps(functions)
  return f"USER: <<question>> {user_query} <<function>> {functions_string}\nASSISTANT: "

I'll keep this open if you run into any issues.

kcolemangt commented 9 months ago

Thank you, @ShishirPatil. This information helps but could you share the full process to execute the api server locally?

With that information we could update the example to use localhost instead of this line

openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"
coolrazor007 commented 9 months ago

@kcolemangt with OpenAI python package you just have to specify the OpenAI compatible API of your local model. For instance, if you are trying this on your local computer, run Ollama with LiteLLM. Then edit the line you mentioned: openai.api_base = "http://localhost"

Edit "localhost" to whatever LiteLLM tells you to use

t-dettling commented 9 months ago

@kcolemangt with OpenAI python package you just have to specify the OpenAI compatible API of your local model. For instance, if you are trying this on your local computer, run Ollama with LiteLLM. Then edit the line you mentioned: openai.api_base = "http://localhost"

Edit "localhost" to whatever LiteLLM tells you to use

I think what @kcolemangt is looking for is how to run a server like endpoint, not just one for local development. If he does not, then I am curious how they are running the endpoint at http://luigi.millennium.berkeley.edu:8000/v1.

Ollama and LiteLLVM are cool for development but they are not really as good for server deployment, I tried to use a compiled model for MLC (https://llm.mlc.ai/docs/deploy/rest.html#install-mlc-chat-package) to run it with the restful API that is compatible with completions, and it does work but it does not look like it has support for the functions to be passed over the API since it is just making the function based on the user input.

For example when I run this code:

import openai

def get_gorilla_response(prompt="Call me an Uber ride type \"Plus\" in Berkeley at zipcode 94704 in 10 minutes",
                         model="gorilla-openfunctions-v0", functions=[]):
    openai.api_key = "EMPTY"
    openai.api_base = "http://192.168.30.27:8000/v1"
    try:
        completion = openai.ChatCompletion.create(
            model="gorilla-openfunctions-v1-q4f32_0",
            temperature=0.0,
            messages=[{"role": "user", "content": prompt}],
            functions=functions,
        )
        return completion.choices[0].message.content
    except Exception as e:
        print(e, model, prompt)

query = "Call me an Uber ride type \"Plus\" in Berkeley at zipcode 94704 in 10 minutes"
functions = [
    {
        "name": "Uber Carpool",
        "api_name": "uber.ride",
        "description": "Find suitable ride for customers given the location, type of ride, and the amount of time the customer is willing to wait as parameters",
        "parameters": [{"name": "loc", "description": "location of the starting place of the uber ride"},
                       {"name": "type", "enum": ["plus", "comfort", "black"],
                        "description": "types of uber ride user is ordering"},
                       {"name": "time", "description": "the amount of time in minutes the customer is willing to wait"}]
    }
]

resp = get_gorilla_response(query, functions=functions)
print(resp)
print("Done!")

The output is always: call_me_uber_ride_type("Plus", Berkeley, zipcode=94704, duration=10) So if MLC is not going to work I was wondering if there is another recommended way to spin up an API server on a dedicated GPU server that I can self host like the one at Berkeley?

ChristianWeyer commented 8 months ago

Thank you, @ShishirPatil. This information helps but could you share the full process to execute the api server locally?

With that information we could update the example to use localhost instead of this line

openai.api_base = "http://luigi.millennium.berkeley.edu:8000/v1"

I am also looking for this. Having it running locally as an OpenAI Function Calling drop-in replacement for existing applications. @ShishirPatil

ramanv0 commented 7 months ago

I've submitted a PR with a guide on self-hosting the OpenFunctions model. It includes instructions and example code for setting up a local server using FastAPI and uvicorn, and for configuring a client to interact with this server using the OpenAI package. I would appreciate a review and any feedback!

ramanv0 commented 7 months ago

Additionally, I created this Colab notebook to quickly test the example server in action. To run the server, I used a A100 GPU, which provided good response latency. To remotely access the server running on the Colab instance from my local machine, I used ngrok to tunnel the server ports from the Colab instance to public URLs. If you want to try out the server, you can get an ngrok auth-token here.