QuivrHQ / quivr

Open-source RAG Framework for building GenAI Second Brains 🧠 Build productivity assistant (RAG) ⚡️🤖 Chat with your docs (PDF, CSV, ...) & apps using Langchain, GPT 3.5 / 4 turbo, Private, Anthropic, VertexAI, Ollama, LLMs, Groq that you can share with users ! Efficient retrieval augmented generation framework
https://quivr.com
Other
34.1k stars 3.34k forks source link

[Bug]: Unrelated results being passed to LLM on query #2301

Closed crscheid closed 2 hours ago

crscheid commented 4 months ago

What happened?

There seems to be a mismatch between the query I'm running and the document results that are being sent to the LLM.

I've created a new brain with about 200-300 documents which resulted in just under 3k vector embeddings. Running a simple search like:

"Can you tell me about Apple?"

Results in:

"I'm sorry, but based on the context provided from the files, there is no information available about Apple. If you have any specific questions or topics you'd like to know more about, please feel free to ask!"

In looking at the logs, I can see the documents chunks that are sent to OpenAI for the LLM call, all of which have nothing to do with Apple. I have checked my documents and there are clearly references to Apple within the document set. So there appears to be a disconnect between my query and the document selection.

Not sure where I should start looking for the disconnect, any advice would be appreciated.

Relevant log output

No response

Twitter / LinkedIn details

No response

dosubot[bot] commented 4 months ago

Hey @crscheid! Great to see you back here. Hope you're doing well.

Based on the context you've provided, it seems like there might be an issue with the document retrieval process in Quivr. The retrieval process is supposed to select documents that are semantically related to your query, but from your description, it appears that this is not happening as expected.

The retrieval process in Quivr is handled by the get_retriever method in the quivr_rag.py file. This method uses a CustomSupabaseVectorStore instance to retrieve documents that are semantically related to your query. The documents are represented as vectors in a high-dimensional space, and the get_retriever method is supposed to return documents that are close to your query in this space.

However, based on your description, it seems like the get_retriever method might not be returning the correct documents. This could be due to a number of reasons:

  1. Lack of Filtering or Ranking Based on Relevance: The code does not include any explicit mechanism for filtering or ranking the documents based on their relevance to the query before sending them to the LLM. This could result in all documents, regardless of their relevance, being sent to the LLM.

  2. Vector Similarity and Matching: There's no visible implementation of comparing these vectors or embeddings to ensure that the documents being sent to the LLM are indeed relevant to the query. The absence of a similarity check or a threshold-based filter to determine which documents are sufficiently related to the query could lead to unrelated documents being included in the results.

  3. Error Handling and Logging: While there is error logging in the case of exceptions, there's no indication that the system actively filters out or handles potential mismatches or errors in vector creation and retrieval that could lead to irrelevant documents being processed.

To debug this issue, you could start by checking the get_retriever method and the CustomSupabaseVectorStore class to see if there's an issue with the document retrieval process. You could also check the vector creation and comparison process to see if there's a mismatch between the vectors of your query and the vectors of the documents.

If you're still having trouble, it might be helpful to provide more information about your setup and the exact steps you're taking when you encounter this issue. This could help us identify any potential issues with your specific setup or usage of Quivr.

I hope this helps, and please let me know if you have any other questions or need further clarification.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
crscheid commented 4 months ago

Possible Cause?

I have noticed that quiver_rag.py creates the vector store against the vectors table here, overriding the default argument of match_vectors -- while supabase.py performs the similarity_search function against the match_vectors table here.

Wondering if this could be the possible cause?

StanGirard commented 4 months ago

Hello my friend !

Thanks for the issue.

The match_vectors is the name of the function used to do similarity search, it is considered as a table in the code.

However Quivr tends to have issues finding relevant information for generic words such as "apple" when there are a big number of documents. I'm currently working on implementing https://supabase.com/blog/matryoshka-embeddings in order to improve relevance of documents returned

crscheid commented 4 months ago

Hello there! Thanks for the response.

Yeah I noticed that with 10 or so of documents it was ok, but as soon as I added 200 or so, it fell over completely on every search term and sent unrelated results. This included very specific searches as well as generic searches. This is strange because I've had success with raw langchain vector search queries in the past.

I'll stay tuned on your enhancements and maybe explore ways on my end to customize this further.

Croccodoyle commented 3 months ago

This problem is there in latest build . Retrieval does not work.....v0.0.225

StanGirard commented 3 months ago

I'm currently working on a new algorithm for hybrid search that should fix this issue :)

On Sat, Mar 16, 2024 at 08:06:14, Croccodoyle < @.*** > wrote:

This problem is there in latest build . Retrieval does not work.....v0.0.225

— Reply to this email directly, view it on GitHub ( https://github.com/QuivrHQ/quivr/issues/2301#issuecomment-2002014017 ) , or unsubscribe ( https://github.com/notifications/unsubscribe-auth/AEVUW3E4UMPYRF5F2UGCEULYYRNWNAVCNFSM6AAAAABEHO3Y42VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMBSGAYTIMBRG4 ). You are receiving this because you commented. Message ID: <QuivrHQ/quivr/issues/2301/2002014017 @ github. com>

mckbrchill commented 3 months ago

I'm currently working on a new algorithm for hybrid search that should fix this issue :)

I've set up the latest repo version and it seems that the results from vector search are not relevant at all. After some digging, I think there is a bug in ..._chunk.sql supabase migrations file in match_vectors function.

Shouldn't the ORDER BY here be DESC?

CREATE OR REPLACE FUNCTION public.match_vectors(query_embedding vector, p_brain_id uuid, max_chunk_sum integer)
 RETURNS TABLE(id uuid, brain_id uuid, content text, metadata jsonb, embedding vector, similarity double precision)
 LANGUAGE plpgsql
AS $function$
BEGIN
    RETURN QUERY
    WITH ranked_vectors AS (
        SELECT
            v.id AS vector_id, -- Explicitly qualified
            bv.brain_id AS vector_brain_id, -- Explicitly qualified and aliased
            v.content AS vector_content, -- Explicitly qualified and aliased
            v.metadata AS vector_metadata, -- Explicitly qualified and aliased
            v.embedding AS vector_embedding, -- Explicitly qualified and aliased
            1 - (v.embedding <=> query_embedding) AS calculated_similarity, -- Calculated and aliased
            (v.metadata->>'chunk_size')::integer AS chunk_size -- Explicitly qualified
        FROM
            vectors v
        INNER JOIN
            brains_vectors bv ON v.id = bv.vector_id
        WHERE
            bv.brain_id = p_brain_id
        ORDER BY
            calculated_similarity -- Aliased similarity
    ), filtered_vectors AS (
        SELECT
            vector_id,
            vector_brain_id,
            vector_content,
            vector_metadata,
            vector_embedding,
            calculated_similarity,
            chunk_size,
            sum(chunk_size) OVER (ORDER BY calculated_similarity) AS running_total
        FROM ranked_vectors
    )
    SELECT
        vector_id AS id,
        vector_brain_id AS brain_id,
        vector_content AS content,
        vector_metadata AS metadata,
        vector_embedding AS embedding,
        calculated_similarity AS similarity
    FROM filtered_vectors
    WHERE running_total <= max_chunk_sum;
END;
$function$
;
github-actions[bot] commented 5 days ago

Thanks for your contributions, we'll be closing this issue as it has gone stale. Feel free to reopen if you'd like to continue the discussion.