ajndkr / lanarky

The web framework for building LLM microservices
https://lanarky.ajndkr.com/
MIT License
976 stars 74 forks source link

Question: How to pass in search_kwargs dynamically to the chain based on the query input? #100

Closed khaledalarja closed 11 months ago

khaledalarja commented 1 year ago

If we take the example that is in the docs:

def create_chain():
db = FAISS.load_local(
    folder_path="vector_stores/",
    index_name="langchain-python",
    embeddings=OpenAIEmbeddings(),
)

return RetrievalQAWithSourcesChain.from_chain_type(
    llm=ChatOpenAI(
        temperature=0,
        streaming=True,
    ),
    chain_type="stuff",
    retriever=db.as_retriever(),
    return_source_documents=True,
    verbose=True,
)

app = mount_gradio_app(FastAPI(title="RetrievalQAWithSourcesChainDemo"))
templates = Jinja2Templates(directory="templates")
chain = create_chain()

@app.get("/")
async def get(request: Request):
    return templates.TemplateResponse("index.html", {"request": request})

langchain_router = LangchainRouter(
    langchain_url="/chat", langchain_object=chain, streaming_mode=1
)

I want to be able to pass to the retriever in the chain the search_kwargs so it does some filtering,

But that should be based on the query of the input, for example, we might have another attribute in the input like a list of authorized_documents_codes so that we can pass this list to the retriever and it can filter the documents in the search.

How to do that with Lanarky?

Here is the original code without streaming to have an idea of what I'm talking about that I used for filtering and the goal is to make it in streaming but with filtering for the retriever:

def send_message(question: Question):
    system_template = """Use the following pieces of context to answer the user's question.
    If you don't know the answer, just say that "I don't know", don't try to make up an answer.
    If the context is empty, just say that "I don't know", but always reply in the same language of the user.
    ----------------
    context:
    {summaries}"""

    messages = [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template(question.question),
    ]
    prompt = ChatPromptTemplate.from_messages(messages)

    llm = ChatOpenAI(
        model_name=question.model_type,
        temperature=question.temperature,
        max_tokens=question.max_tokens,
    )

    search_kwargs = {"k": question.k}

    if question.authorized_codes:
        if len(question.authorized_codes) > 1:
            or_filter = [
                {"source": {"$eq": code}} for code in question.authorized_codes
            ]
            search_kwargs["filter"] = {"$or": or_filter}
        elif len(question.authorized_codes) == 1:
            search_kwargs["filter"] = {
                "source": {"$eq": question.authorized_codes[0]}
            }

    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=langchain_db_client.as_retriever(search_kwargs=search_kwargs),
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt},
    )

    results = chain(question.question)

    return results

@app.post("/query")
def query(body: Question):
    logger.info(f"Received query: {body}")
    response = send_message(body)
    logger.info("Query processed successfully.")
    return response

The problem with Lanarky is that the chain should be passed at the creation of the router, so that makes it not dynamic and susceptible for modifications..

ajndkr commented 1 year ago

Hi @khaledalarja, the endpoint creation in the examples is an abstraction to create a /chat endpoint for basic use cases. for more complex use cases such as yours, I recommend checking out this example: https://github.com/menloparklab/chatbot-ui/blob/d2d4aa84ebb6351bfacee99429228d091221b60b/backend/app.py#L101-L114

The idea is to use the router instance to manually create a new endpoint. The advantage of using the LangchainRouter over FastAPI's APIRouter is that you get the LLM caching built-in.

You will need to create your own pydantic models for request and response body. You can use the above example for the same.

khaledalarja commented 1 year ago

Hi @ajndkr Thanks so much for your reply !

I have managed to have some streaming using the example that you gave me, but there is something strange that I noticed, When I use gpt-3.5-turbo, the streaming is handled sentence by sentence, and when I choose gpt-4 here I lose all the streaming and I will have to wait for the response to complete!

Here is the modified code:

class Prompt(BaseModel):
    query: str
    authorized_codes: Optional[List[constr(regex=r"^\d{2}[A-Z]{3,5}\d{2,5}")]] = None
    temperature: float = 0.0
    max_tokens: int = 1000
    k: int = 6
    model_type: str = "gpt-4"

def create_chain(user_input: Prompt) -> AsyncIterable[str]:
    system_template = """Use the following pieces of context to answer the user's question.
    If you don't know the answer, just say that "I don't know", don't try to make up an answer.
    If the context is empty, just say that "I don't know", but always reply in the same language of the user.
    ----------------
    context:
    {summaries}"""

    messages = [
        SystemMessagePromptTemplate.from_template(system_template),
        HumanMessagePromptTemplate.from_template(user_input.query),
    ]
    prompt = ChatPromptTemplate.from_messages(messages)

    llm = ChatOpenAI(
        model_name=user_input.model_type,
        temperature=user_input.temperature,
        max_tokens=user_input.max_tokens,
        streaming=True,
        verbose=True,
    )

    chain = RetrievalQAWithSourcesChain.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=langchain_db_client.as_retriever(),
        return_source_documents=True,
        chain_type_kwargs={"prompt": prompt}
        )
    return chain

@app.post("/query")
async def query(user_input: Prompt):
    chain = create_chain(user_input)
    return StreamingResponse.from_chain(chain, user_input.query)

I believe that the callback method waits a new line in the response to stream. when I choose gpt-3.5 it's the response that contains a numbered list, while with gpt-4 I have no newline characters that's why it's waiting for all the response to finish.

Any idea how to stream token by token like normal instead of waiting for newline characters?

ajndkr commented 1 year ago

hmm. this is a weird error. both gpt-3.5 and gpt-4 use the same callback. So it's likely that the streaming is affected by openai API and not this library. can you try again with a different chain or different system prompt?

ajndkr commented 11 months ago

closing issue due to user inactivity.

You can check out the new documentation for v0.8: https://lanarky.ajndkr.com/learn/adapters/langchain/

please reopen this issue if you'd like to discuss more.