[Question] Drop in replacement for OpenAI

MillionthOdin16 commented 1 year ago

I notice that you mentioned your goal of creating a drop in replacement for OpenAI. Awesome job! This is super helpful to have and especially with your demo using fastAPI.

I'm looking at langchain right now, and I see you have implemented most, if not all, of the OpenAI API including streaming. Since it got official integration with langchain today, and I'm getting ready to get the integration working with streaming as literally a drop in for OpenAI in langchain. Do you already have this done? Just trying to see what your goals are in the near future for this package :)

abetlen commented 1 year ago

Haha thank you, I actually had no idea about @rjadr's PR or official langchain integration so to answer your question, no I don't have anything else started in the way of langchain but if you or anyone wants to take that on I have absolutely no objections.

I'll probably put together a seperate issue to track short term goals but currently they are:

[ ] Simplify and test the implementation of stop sequences to ensure correctness in both simple and streaming modes.
[ ] Improve the packaging and release process (currently using setup.py which I believe is a little outdated).
[ ] Add tests or some kind of build process now that presumably more people are going to be using this project.
[ ] Integrate the new kv_state API from llama.cpp and add chat_completion support (need to use that api or else latency is intolerable)

MillionthOdin16 commented 1 year ago

Yeah there have been a lot of us that have been waiting for llama CPP to get into langchain, so we're pumped! It's cool to have it tied directly into the library, but since your fastAPI is so similar to OpenAI already I think it would be super cool to pretty much just have a drop-in swap for using the API endpoint. But yeah dude, Nice job! All this stuff is moving pretty quick and a lot of projects are either kind of half implemented or they get their implementation and then disappear. Yours actually works really well!

On Mon, Apr 3, 2023, 00:53 Andrei @.***> wrote:

Haha thank you, I actually had no idea about @rjadr https://github.com/rjadr's PR or official langchain integration so to answer your question, no, but if you want to take that on I have absolutely no objections.

I'll probably put together a seperate issue to track short term goals but currently they are:

Simplify and test the implementation of stop sequences to ensure correctness in both simple and streaming modes.

Improve the packaging and release process (currently using setup.py which I believe is a little outdated).

Add tests or some kind of build process now that presumably more people are going to be using this project.

Integrate the new kv_state API from llama.cpp and add chat_completion support (need to use that api or else latency is intolerable)

— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/9#issuecomment-1493657726, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AEVRWDSE3L4FXIIIQTW7JJWDANCNFSM6AAAAAAWQVILGY . You are receiving this because you authored the thread.Message ID: @.***>

abetlen commented 1 year ago

Thank you so much, I hope some other people get some good use out of this.

I see what you're asking now also, yes I think in theory this should just work because LangChain already supports changing the OpenAI endpoint so you can point to ie Azure. I haven't tested this yet though but if there are any issues we can update the fastapi example accordingly.

MillionthOdin16 commented 1 year ago

Awesome! I'll look more into it tomorrow. But your end point is really close to being there. And it also just works 🔥🔥 thanks!

On Mon, Apr 3, 2023, 01:04 Andrei @.***> wrote:

Thank you so much, I hope some other people get some good use out of this.

I see what you're asking now also, yes I think in theory this should just work because LangChain already supports changing the OpenAI endpoint so you can point to ie Azure. I haven't tested this yet though but if there are any issues we can update the fastapi example accordingly.

— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/9#issuecomment-1493665939, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AGAN22XRW5Q6IN4OMTW7JK45ANCNFSM6AAAAAAWQVILGY . You are receiving this because you authored the thread.Message ID: @.***>

abetlen commented 1 year ago

Would be cool to test interoperability against something like Helicone AI as well.

MillionthOdin16 commented 1 year ago

I'm down to help add compatibility with other formats as well. This is my first experience with fastAPI and it's pretty impressive. I've been looking at the API some and I'm wondering what approach you want to take for compatibility. Do you want to stay with just the embeddings and completions endpoints, or do you want to send failure responses for parts of the API that aren't implemented?

I have noticed some funkyness with the stop sequences that you mentioned. And I didn't the kv_state was implemented until I saw it in your code haha.

add chat_completion support (need to use that api or else latency is intolerable)

I'm a bit confused by what you mean with this line. Are you differentiating between completion and chat_completion?

abetlen commented 1 year ago

I'm down to help add compatibility with other formats as well. This is my first experience with fastAPI and it's pretty impressive. I've been looking at the API some and I'm wondering what approach you want to take for compatibility. Do you want to stay with just the embeddings and completions endpoints, or do you want to send failure responses for parts of the API that aren't implemented?

Yes, I think following the official names for parameters at least in the fastapi example is ideal so we can allow for drop-in replacement. For now an error response on not implemented features is probably fine.

I have noticed some funkyness with the stop sequences that you mentioned. And I didn't the kv_state was implemented until I saw it in your code haha.

Yeah it is but it seems that the upstream implementation for restoring model state is still not complete.

add chat_completion support (need to use that api or else latency is intolerable)

I'm a bit confused by what you mean with this line. Are you differentiating between completion and chat_completion?

Yes, see OpenAI API Reference

I have an initial implementation for chat in the fastapi example but I haven't pushed anything yet because the performance is really slow for chats.

abetlen commented 1 year ago

@MillionthOdin16 Just a heads up, I just pushed the initial chat completion API and updated the fastapi server example. Chat completion will be pretty slow should be working.

MillionthOdin16 commented 1 year ago

@abetlen Awesome 🔥 So I was working a lot on recreating the functionality of main today, and noticed a couple differences that I wasn't expecting. I want to make sure I understand how you're implementing things. I'll explain my mental model and you can correct it if I'm wrong.

Init / Prompt

The initialization of the model made sense, except I found myself trying to initialize the llm with the prompt. I don't know if it's a habit coming from langchain, or from running the llama.cpp executable, but I'm used to loading the prompt at initialization to get it processing right off the bat.

prompt = """
The following is a conversation between a user and an AI assistant. The assistant is helpful, creative, clever, and very friendly.
User: Hello, who are you?
AI: I am an AI Assistant, here to help you by providing clear and concise responses. How can I help you today?
"""

# Initialize the Llama model with the specified model path and settings
llm = Llama(model_path="D:\models\gpt4all\gpt4all-lora-unfiltered-quantized-llama-nmap.bin",
            f16_kv=True,
            use_mlock=False,  # Set to False if you get an error on an unsupported platform
            n_ctx=1024,       # Tokens to keep in 'memory' (Max:2048)
            n_threads=4,      # Number of threads to use for inference
  --->     #prompt=prompt,    # Start the conversation with an initial prompt
            )

Completions

After initialization, I attempted the following, basically generating a completion for each message the user sent to the bot. I noticed that the model didn't seem to maintain context of the previous messages sent during the session, and I only got more predictable responses when I sent the entire prompt and message history every completion 😢 haha. This took a significant amount of time and I'm thinking there's a way we can maintain context once we fix it.

# Start an infinite loop to prompt the user for input and generate a response
while True:

# Prompt the user for input
    user_input = input("\nUser: ")

    # Use the Llama model to generate a response
    stream = llm.create_completion(
        user_input + "\nAI: ",  # Add the user's input and prep for the AI's response
        max_tokens=256,          # Max number of tokens to generate for response
        stop=["User:"],         # Stop generating tokens when one of these is encountered
        stream=True,            # Return a generator to stream the response
    )

    # Print this to in chat to indicate the start of the AI's response
    # This is not necessary, but it helps to make the output more readable
    print("AI: ", end='')

    # Extract and print the chatbot's response
    for output in stream:
        # Get the 'choices' list from the output dictionary
        choices = output.get('choices', [])
        # Extract the 'text' value from the first element in the choices list
        chatbot_response = choices[0].get('text', '') if choices else ''
        print(chatbot_response, end='')

I hope that makes sense?

My understanding for the kv_store was that it was something initially implemented to help improve performance with the prompt, so it didn't have to be tokenized over and over between model "sessions" (meaning the context is only lost once the process is stopped). So you compute it once, then just load in previously tokenized prompts when the llm is initialized if they're available.

Let me know your thoughts.

❓ On a side note, are you opposed to opening up the Discussions section for the repo? It might be a better place for planning and discussing topics wider than the scope of specific issues.

Also, just thought about the point about testing you brought up earlier. Is definitely a good plan considering langchain people are likely using this and we don't want to push breaking changes :)

abetlen commented 1 year ago

No, in many cases (ie. tokenizing text, generating embeddings) you'll want to defer sending the prompt except to those specific methods.
That's correct __call__ and create_completion reset the state of the model. If you want to get around this and are willing to work a little lower level you should check out how the generate method works. Essentially it calls reset then just calls eval and sample in a loop.

PS: I've opened up Discussions so yes, feel free to post there!

MillionthOdin16 commented 1 year ago

Okay, I'm looking into this more trying to figure out how it works. I'm having trouble understanding why llama.cpp main can keep context between messages, but the context here gets cleared. My understanding from llama.cpp was the following:

model initializes, gets the prompt, parses it, then awaits user input

When user input is received it parses it and adds to the context, then processes the completion returns completion to user continues this input loop until process dies

And throughout that process the context grows (until when it's full it has to start dropping data)

If this isn't the case, do you know how llama.cpp main is getting by without having to send the whole history each message? Sorry haha, my mental model is kinda broken at this point.

abetlen commented 1 year ago

So I'm actually explicitly clearing the context by calling reset every time __call__ and create_completion is called. But you don't have to, I only do this because I wanted the API to match OpenAI.

Check out this example, I think it'll give you a good idea of the core loop. This example doesn't clear the history, it'll just keep generating after your initial prompt, then if you Ctrl-C during generation you can add more text and then make it keep going.

import llama_cpp

llama = llama_cpp.Llama(model_path="../models/ggml-model.bin")

while True:
    text = input()
    tokens = llama.tokenize(text.encode("utf-8"))
    llama.eval(tokens)
    try:
        while True:
            token = llama.sample(top_k=40, top_p=0.95, temp=0.8, repeat_penalty=1.1)
            if token == llama.token_eos():
                break
            print(llama.detokenize([token]).decode("utf-8"), end="", flush=True)
            llama.eval([token])
    except KeyboardInterrupt:
        pass
    print()

MillionthOdin16 commented 1 year ago

Okay, I didn't realize it was being cleared intentionally. It makes sense now. I think I'm understanding more of what you mean...

So for the completions, which is what you had implemented originally, it's just a one-off completion you send in the message and it will complete it based off however many tokens you want. I agree with clearing it in this sense.

Now for implementing the chat part of the API, wouldn't we pretty much have that done if we weren't resetting the context each message?

Is the issue on our side, with just deciding when we need to store or reset context, or is it on the llama.cpp side, where there's some functionality that we don't have that we need?

Thanks for the explanation. I initially got the impression that there was no ability for context between messages and that was confusing.

On Tue, Apr 4, 2023, 00:35 Andrei @.***> wrote:

So I'm actually explicitly clearing the context by calling reset every time call and create_completion is called. But you don't have to, I only do this because I wanted the API to match OpenAI.

Check out this example, I think it'll give you a good idea of the core loop. This example doesn't clear the history, it'll just keep generating after your initial prompt, then if you Ctrl-C during generation you can add more text and then make it keep going.

import llama_cpp llama = llama_cpp.Llama(model_path="../models/ggml-model.bin") while True: text = input() tokens = llama.tokenize(text.encode("utf-8")) llama.eval(tokens) try: while True: token = llama.sample(top_k=40, top_p=0.95, temp=0.8, repeat_penalty=1.1) if token == llama.token_eos(): break print(llama.detokenize([token]).decode("utf-8"), end="", flush=True) llama.eval([token]) except KeyboardInterrupt: pass print()

— Reply to this email directly, view it on GitHub https://github.com/abetlen/llama-cpp-python/issues/9#issuecomment-1495332533, or unsubscribe https://github.com/notifications/unsubscribe-auth/AYMC3AFG2SO4CHEAE3BSXCTW7OQIRANCNFSM6AAAAAAWQVILGY . You are receiving this because you were mentioned.Message ID: @.***>

abetlen commented 1 year ago

Is the issue on our side, with just deciding when we need to store or reset context, or is it on the llama.cpp side, where there's some functionality that we don't have that we need?

It's on llama.cpp side, there's a soon to be llama_state api which will allow us to create a caching layer so we don't have to recompute the entire prompt before generating new tokens.

abetlen commented 1 year ago

Got the OpenAI python library to use the FastAPI server, I'll have to check but this might make it fairly easy to port to every language supported by their library.

MillionthOdin16 commented 1 year ago

So you're in favor of clearing the context every message even in situations where we want to maintain context during conversations?

I get what you're saying about being able to save and load conversation histories into context more quickly using stores. But I can also see situation where some models being run will want to have a continuous context and clearing and loading the state. Each message would become inefficient.

I'm thinking of situations where you have a language model carrying out tasks that's not interacting with a person but is interacting with tools and APIs.

I can see clearing context when you want an independent completion, or you have a different user that needs a separate context, but I think there are also some cases we might want to consider where the context wouldn't auto reset that makes sense.

If I'm missing something completely let me know. Thanks for the explanations.

Got the OpenAI python library to use the FastAPI server, I'll have to check but this might make it fairly easy to port to every language supported by their library. ^^^Awesome job with this!! 🔥

abetlen commented 1 year ago

I can see the use for that for sure, basically you want something that does the eval+sample loop until some condition is met and then return the generated text?

We can definitely do that for some conditions like say max_tokens and / or the eos token but stop sequences won't work unless we restore the state. Otherwise the model will have essentially generate the stop sequence tokens and they'll be stored in the internal state even if we don't return them to the user

MillionthOdin16 commented 1 year ago

I'm still trying to figure out how the main llama.cpp gets around this. I don't know why I can't figure out what I'm missing. Maybe I'm thinking about interactive mode which is handled by the couple snippets below.

// replace end of text token with newline token when in interactive mode
if (id == llama_token_eos() && params.interactive && !params.instruct) {
    id = llama_token_newline.front();
    if (params.antiprompt.size() != 0) {
        // tokenize and inject first reverse prompt
        const auto first_antiprompt = ::llama_tokenize(ctx, params.antiprompt.front(), false);
        embd_inp.insert(embd_inp.end(), first_antiprompt.begin(), first_antiprompt.end());
    }
}

// end of text token
if (embd.back() == llama_token_eos()) {
    if (params.instruct) {
        is_interacting = true;
    } else {
        fprintf(stderr, " [end of text]\n");
        break;
    }
}

Are you familiar with interactive mode? This might be what I'm thinking of where the context isn't cleared. Because I can't find anywhere in the code where context is reset during a chat (interactive mode). And it's also not having to reload as far as I can tell. There definitely isn't a delay :/. I'll keep thinking about this.

Definitely cool stuff with the API though! There's tons of projects that interface with OpenAI and it would be cool to drop in my own models. 👍

❗ Also, this is just a side note, but we should make sure we add the space in our implementation:

// Add a space in front of the first character to match OG llama tokenizer behavior
params.prompt.insert(0, 1, ' ');

// tokenize the prompt
auto embd_inp = ::llama_tokenize(ctx, params.prompt, true);

abetlen commented 1 year ago

So in the case of interactive mode the model still sees the antiprompt / reverse prompt because it's just part of the standard conversation. A stop sequence is explicilty not returned by a model (in LangChain and the OpenAI APIs).

abetlen / llama-cpp-python

[Question] Drop in replacement for OpenAI #9

Init / Prompt

Completions