blockentropy / ml-client

Machine Learning Clients for Open Source Infra
8 stars 3 forks source link

Adding outlines #1

Closed isamu-isozaki closed 6 months ago

isamu-isozaki commented 6 months ago

This is a draft PR. Currently, the 3 main parts left to do to make this work is

isamu-isozaki commented 6 months ago

I think I'll pull from https://github.com/outlines-dev/outlines/pull/781 which will probably solve 1 and 3

edk208 commented 6 months ago

thanks, looking good so far... its nice that outlines already supports exl2

isamu-isozaki commented 6 months ago

@edk208 Some notes

  1. I think I finished the main logic
  2. For the logic of first doing preprocess and then generating tokens that are currently unfortunately not supported by outlines. I can make an outline fork that supports it but I think it can be a bit hacky. Does doing the preprocess first across all prompts offer better performance?
  3. The current script doesn't support proper streaming but I can make it generate one token at a time-> stream using the PR mentioned above though this functionality is not in the main branch of outlines yet so is more experimental

So in summary I think these are all the changes that can work from the main branch of outlines so far. Happy to get feedback!

isamu-isozaki commented 6 months ago

I'll do the streaming idea tonight

edk208 commented 6 months ago

what do you mean by the "logic of first doing preprocess and then generating tokens"? do you mean the first model.forward with preprocess_only = True?

isamu-isozaki commented 6 months ago

@edk208 sry for the confusion and yes. To my understanding, the process is

  1. We get the prompts -> tokenize+preprocess in exllama2
  2. Generate 1 token for each of those prompts
  3. For all end of sequence tokens stop Put all in a while loop until all prompts and prompt ids are exhausted.

I think step 1 is technically not possible in outlines but steps 2 and 3 might be possible in the above pr. Let me try it tomorrow

edk208 commented 6 months ago

@isamu-isozaki yes that's correct. The preprocess runs the prompts through and sets up the KV cache, then you can round-robin through them and generate one token at a time. Interesting that outlines doesn't like step 1. I would imagine it would have to do that anyway. I can take a look too in the next few days.

isamu-isozaki commented 6 months ago

Hi! I think the main logic is done. For the test I used config.ini

[settings]
host = 127.0.0.1
port = 12345
upload_url = https://url/api/upload
path_url = https://url/folder/

[phi3b]
string = phi3b
repo = ..../Phi-3-mini-128k-instruct-exl2

with the model from here and I started the server with

python llm_exl2_client_multi.py --port=5000 --use_outlines --gpu_split="5" --max_context=512 --repo_str=phi3b

Then on the client side, I did

from langchain.prompts import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage, AIMessage
import json
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=1.0,
                openai_api_base="http://localhost:5000/v1", 
                openai_api_key="Test",
                streaming=True, 
                max_tokens=1024)
messages = [
    SystemMessage(
        content="You are a helpful assistant."
    ),
    HumanMessage(
        content="Who is more impressive? Bob or Fred?"
    )
]
choices = ["Bob", "Fred"]

for chunk in llm.stream(messages, extra_body={"stop_at":"done", "outlines_type": "choices", "choices": choices}):
    print(chunk.content, end="", flush=True)

which got me Bob. I can do more tests if you want but I think it's working. One main logic here is that for adding new parameters to the open ai API we use extra_body rather than function calling/tool calling since I couldn't think of an easy way to translate it.