astronomer / ask-astro

An end-to-end LLM reference implementation providing a Q&A interface for Airflow and Astronomer
https://ask.astronomer.io/
Apache License 2.0
196 stars 47 forks source link

Research: Improve the ranking of the sources #133

Closed vatsrahul1001 closed 10 months ago

vatsrahul1001 commented 12 months ago

While Testing Ask Astro today we noticed some issue with below questions

  1. what is Astro SDK? The response for this was incorrect(I'm sorry, but there's no such thing as Astro-SDK.) and the sources were irrelevant. Slack Thread

  2. What is latest version for Astronomer Providers The response was not correct as per Ask Astro latest version is 1.14.0, however, latest release is 1.18.2 Slack Thread

  3. Related docs incorrect for how to install the CLI on Linux? Slack Thread

  4. Ask Astro not recognize version 2.7.3 and treat 2.2.2 as latest Slack Thread

Based on above try the following in the order:

  1. tokenization as lowercase #145
  2. Cohere reranking - PR
  3. hybrid search - PR

Test iteration for each experiment by @vatsrahul1001

--- Update 1/8 Merging two other similar/duplicate issues to this one, closing the other two ---

https://github.com/astronomer/ask-astro/issues/213 https://github.com/astronomer/ask-astro/issues/80

sunank200 commented 11 months ago

Here is Langsmith trace for

  1. Yes, for what is the Astro SDK? The following is langsmith trace. It doesn't find the correct docs in Retriever. But it should have either picked [1], [2], [3], [4], [5] which is already ingested.

But when we ask - What is Astro Python SDK? It gives correct answers with the right sources. Here is langsmith trace

sunank200 commented 11 months ago

For What is the latest Airflow version? it gave the correct response though. Langchain trace is here

mpgreg commented 11 months ago

Initial analysis is at https://docs.google.com/document/d/17OBh5b9fQM3kq_n1fxbIL49b2fM0-ipPPD-Ju0_2nIo/edit

TLDR; The docSource property is vectorized during ingest and search. This is skewing search results containing ‘astro’ and ‘astronomer’ towards certain sources.
By itself the “source skew” is not a problem but without hybrid search the vector will skew somewhat randomly towards. Formatting of Astro docs ingested from Markdown confuses the LLM for answering.

Recommendations: Remove docSource. It is not used and inconsistently named. Or… change schema for docSource to skip vectorization. Implement hybrid search Change astro doc ingest to extract from HTML sources

mpgreg commented 11 months ago

I'm testing now with local docs with skip=True for docSource in schema:

                {
                    "name": "docSource",
                    "description": "Type of document ('learn', 'astro', 'airflow', 'stackoverflow', 'code_samples')",
                    "dataType": ["text"],
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": "True",
                            "vectorizePropertyName": "False"
                        }
                    }
                },
sunank200 commented 11 months ago

We tried hybrid search with Cohere reranking and this has degraded the performance. Hence not a priority for 28th Nov release

phanikumv commented 10 months ago

David to look into this

sunank200 commented 10 months ago

David to have first results by EOW

sunank200 commented 10 months ago

@davidgxue any updates on this?

davidgxue commented 10 months ago

My apologies I forgot to update on Github. I sync’ed with Steven on Friday and send out a google doc that contains approaches to experiment with based on initial analysis and observations (https://docs.google.com/document/d/1j-Hr8dchwBWDxejAf1dvcGIA_Y6lVQ-W9VP0k7Zw9zE/edit?usp=sharing). I am currently on PTO this entire week but I will update with more details when I get back.

davidgxue commented 10 months ago

Update

Current Progress Update

Research Report

Next Steps and Approaches for Experimentation and Implementation

Retrieval Enhancements

  1. Hybrid / Sparse + Dense Vector Search Integration with Cohere Reranking

    • Implement a hybrid search combining BM25 and an embedding model with rank fusion scoring to narrow down results to the top 100-300 documents.
    • Rerank the shortlisted documents using Cohere to refine the selection to approximately 10 documents.
    • This approach aims to improve performance and reduce latency compared to the sequential use of BM25, embedding models, and rerankers.
  2. Language Model Prompt Rewording for Multi-Query Retrieval

    • Maintain the original user prompt as one of the queries to preserve the initial intent.
    • Optimize the rewording prompt for GPT-3.5 to ensure it is concise and summarizes the query without introducing extraneous content.
  3. Parent Document Retriever Implementation

    • Address the issue where relevant keywords or related terms appear in one section of a page, but the actual answers are located in a different section of the same document.
  4. Final Relevance Check with a Cost-Effective LLM

    • Utilize a less resource-intensive language model like GPT-3.5 (e.g., LLMChainFilter in LangChain) to assess the relevance of the retrieved documents before processing them with GPT-4 for final response generation.

Response Generation Improvements

  1. Prompt Engineering for Source Citation
    • Refine the system prompt that triggers GPT-4 for final response generation to require explicit citation of the source document used in the answer, as recommended by Julian. This should prevent the generation of responses without proper backing from source documents.

Data Management and Optimization

Vector Database (Vector DB)
  1. Exploration of Alternative Embedding Models

    • Investigate the use of embedding models from Cohere as potential alternatives to the text-ada-002 model from OpenAI.
  2. Vector DB Schema Modification

    • Adjust the Vector DB schema to exclude the vectorization of certain non-essential attributes, such as docSource.
Data Ingestion Process
  1. Data Cleaning and Exclusion of Irrelevant Content

    • Implement data cleaning during ingestion to remove non-essential content like navigation bars, footers, headers, and other irrelevant sections that may introduce keyword spam and reduce retrieval accuracy.
  2. Review and Refinement of Chunking Logic

    • Reassess the logic for document chunking to prevent the inclusion of headers or short, meaningless text segments.
  3. Summarization of Large Documents

    • Generate and insert summaries for excessively large documents that are split into numerous chunks, using a language model to aid in comprehension and retrieval.
  4. Topic Keyword Extraction and Metadata Storage

    • Perform topic keyword extraction on each document in the Vector DB, store the results as metadata, and enhance queries with user-prompt-derived keywords during Q&A sessions. This strategy requires significant effort and its effectiveness is yet to be determined.