better search - Githubissues

wassname commented 9 months ago

Impressive work, it's efficient and potent. Here's a suggestion.

The search is the critical component! It's the bottleneck for answering all queries, given you already possess a robust corpus.

Currently, you're using a standard vectordb search on the query. However, this approach has significant limitations:

It's searching for documents similar to the query's embedding... but what if the query doesn't resemble the answer?
At times, traditional search methods can be superior or provide additional benefits.

EnsembleRetriever

Fortunately, Langchain offers modules for various retriever enhancements, and all you need to do is test them out. You can bundle multiple retrievers in an EnsembleRetriever.

In this scenario, you have a vectordb query match, but you might also want to implement a standard search like a BM25Retriever. It's cheap and can drastically improve your retrieval.

and potentially MultiVector since the document's content may differ from the questions the document could answer.

A concrete example:

I search for "what is a way for ai to get copy human saintly acts", this embedding will not have much in common with documents discussing imitation learning, it has a different tone, writing style, words, etc. But we still want to answer it because it's from a US Senator ;p
we ask the AI "what's a good search query to find relevant documents to this question".
It says "how can imitation learning, behavioral cloning, values learning, let us achieve outstanding moral actions",
we use that in our retriever.
much better document results

Adv Pinecone features

You might also want to consider using the other retrieval feature of Pinecone: https://www.pinecone.io/learn/hybrid-search-intro/

Better embedding

You might also consider using the BEST embedding, as per this retrieval leaderboard https://huggingface.co/spaces/mteb/leaderboard the best ones are the `e5`` series of embedding because they specifically tie query and answer passages rather than just text and text.

Result re-ranking

in continue.dev they use LLM re-ranking of results. I'm not sure about this, but it's worth considering.

wassname commented 9 months ago

Here's an example of my draft in https://github.com/wassname/stampy-chat

NOTE: I'm using GPT4 in the second screenshot, so please compare the references not the writing

wassname commented 9 months ago

What's happening behind the scenes in the screenshot?

The user put in an initial query

Whats the differences between Inverse Reinforcement Learning, reward modelling, RLHF, and recursive reward modelling?.

It's transformed into a better query and and example answer using these prompts

Please draft an academic search query with synonyms and alternative phrases that will find documents to answer the following question: {query}"

and

"Please draft an concrete and concise example answer to the following question: {query}"

Then they are joined into the new 3 part query (user query, improved query, example)

Whats the differences between Inverse Reinforcement Learning, reward modelling, RLHF, and recursive reward modelling?.
. ("Inverse Reinforcement Learning" OR "IRL" OR "Inverse RL") AND ("Reward Modelling" OR "Reward Shaping" OR "Reward Function Modelling") AND ("RLHF" OR "Reinforcement Learning with Human Feedback") AND ("Recursive Reward Modelling" OR "Iterative Reward Modelling") AND ("comparison" OR "differences" OR "contrast" OR "versus" OR "vs").
Inverse Reinforcement Learning (IRL) is a method where the goal is to learn the reward function from demonstrated behavior, which is then used to derive an optimal policy.

wassname commented 9 months ago

There are nicer and better ways to do this, but hopefully it shows how improving retrieval can de-bottleneck stampy. And there are lots of low-hanging fruit here compared to your excellent dataset and UI work.

StampyAI / stampy-chat

better search #138

EnsembleRetriever

Adv Pinecone features

Better embedding

Result re-ranking