Lightning-AI / LitServe

Lightning-fast serving engine for any AI model of any size. Flexible. Easy. Enterprise-scale.
https://lightning.ai/docs/litserve
Apache License 2.0
2.5k stars 158 forks source link

Example in documentation on how to setup an OpenAI-spec API with LlamaIndex-RAG #286

Closed PierreMesure closed 2 months ago

PierreMesure commented 2 months ago

🚀 Feature

A new page of documentation explaining how to expose an LlamaIndex RAG using an OpenAI-compatible API.

Motivation

It took me a good 6 hours to put together these two tutorials: LlamaIndex RAG API and OpenAI spec to expose my LlamaIndex app with an OpenAI-spec API. Maybe I'm a bit stupid but I think this should be a pretty common use-case so I wanted to write a new page in Litserve's documentation. But I couldn't find the docs source code here so I'm writing an issue instead.

Code

server.py

from simple_llm import SimpleLLM
import litserve as ls

class LlamaIndexAPI(ls.LitAPI):
    def setup(self, device):
        self.llm = SimpleLLM()

    def predict(self, messages):
        for token in self.llm.stream(messages):
            yield token

if __name__ == "__main__":
    api = LlamaIndexAPI()
    server = ls.LitServer(api, spec=ls.OpenAISpec(), stream=True)
    server.run(port=8000)

simple_llm.py

from llama_index.llms.openai import OpenAI
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.llms import ChatMessage

class SimpleLLM(object):
    def __init__(self):
        reader = SimpleDirectoryReader(input_dir="data")
        docs = reader.load_data()
        index = VectorStoreIndex.from_documents(docs, show_progress=True)

        llm = OpenAI(model="gpt-4o-mini", temperature=0)
        self.engine = index.as_chat_engine(streaming=True, similarity_top_k=2, llm=llm)

    def stream(self, messages_dict):
        messages = [
            ChatMessage(
                role=message["role"],
                content=message["content"],
            )
            for message in messages_dict
        ]

        return self.engine.stream_chat(
            messages[-1].content, chat_history=messages[:-1]
        ).response_gen

test.py

from openai import OpenAI

endpoint = "http://localhost:8000/v1"

client = OpenAI(base_url=endpoint, api_key="lit")

response = client.chat.completions.create(
    model="lit",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Give me your favourite colour"},
        {"role": "assistant", "content": "I quite like green."},
        {"role": "user", "content": "Why is it so?."},
    ],
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.content)
aniketmaurya commented 2 months ago

hi @PierreMesure, thank you for creating the issue. Sorry, we don't have the functionality of contributing to the docs at this time (hopefully soon).

This is a great example of a Studio templates. We feature LitServe based templates here. If you're interested in creating a template for your code, similar to LlamaIndex RAG API but with OpenAISpec, it would be a great candidate to be featured on docs.

PierreMesure commented 2 months ago

Thank you Aniket, that sounds like a good idea! I'm not sure how to create a template but I guess the first step is to create an account. I'll try in the coming days. 🙂

EDIT: I tried but I didn't manage to go through the onboarding process (invalid phone number?). I have no need for a studio in this service, I just wanted to add documentation for Litserve which I'm using on my own machines. So I guess I'll wait until there's another way to contribute.

whisper-bye commented 1 month ago

@PierreMesure Could you please explain how you handle tool calls in the predict function? It seems that predict only returns a text stream.

PierreMesure commented 1 month ago

I haven't tried that. Hope you can publish your code when you've got it working 🙂.