Closed vatsrahul1001 closed 10 months ago
Here is Langsmith trace for
what is the Astro SDK?
The following is langsmith trace. It doesn't find the correct docs in Retriever. But it should have either picked [1], [2], [3], [4], [5] which is already ingested.But when we ask - What is Astro Python SDK?
It gives correct answers with the right sources. Here is langsmith trace
For thie following question:
Hi all ,
Need your help and guidance
I am going to install airflow version 2.7.3 in new Ec2 instance with postgres
EC2 - t3.x large (4 vCPUs, 16 GB RAM )
What is the ubuntu and python version that will be compatible with this ?
Anyone who installed 2.7.3 can share your thought
Langsmith trace is here
For What is the latest Airflow version?
it gave the correct response though. Langchain trace is here
Initial analysis is at https://docs.google.com/document/d/17OBh5b9fQM3kq_n1fxbIL49b2fM0-ipPPD-Ju0_2nIo/edit
TLDR;
The docSource property is vectorized during ingest and search. This is skewing search results containing ‘astro’ and ‘astronomer’ towards certain sources.
By itself the “source skew” is not a problem but without hybrid search the vector will skew somewhat randomly towards.
Formatting of Astro docs ingested from Markdown confuses the LLM for answering.
Recommendations: Remove docSource. It is not used and inconsistently named. Or… change schema for docSource to skip vectorization. Implement hybrid search Change astro doc ingest to extract from HTML sources
I'm testing now with local docs with skip=True for docSource in schema:
{
"name": "docSource",
"description": "Type of document ('learn', 'astro', 'airflow', 'stackoverflow', 'code_samples')",
"dataType": ["text"],
"moduleConfig": {
"text2vec-openai": {
"skip": "True",
"vectorizePropertyName": "False"
}
}
},
We tried hybrid search with Cohere reranking and this has degraded the performance. Hence not a priority for 28th Nov release
David to look into this
David to have first results by EOW
@davidgxue any updates on this?
My apologies I forgot to update on Github. I sync’ed with Steven on Friday and send out a google doc that contains approaches to experiment with based on initial analysis and observations (https://docs.google.com/document/d/1j-Hr8dchwBWDxejAf1dvcGIA_Y6lVQ-W9VP0k7Zw9zE/edit?usp=sharing). I am currently on PTO this entire week but I will update with more details when I get back.
Hybrid / Sparse + Dense Vector Search Integration with Cohere Reranking ✅
Language Model Prompt Rewording for Multi-Query Retrieval ✅
Parent Document Retriever Implementation
Final Relevance Check with a Cost-Effective LLM ✅
Exploration of Alternative Embedding Models
text-ada-002
model from OpenAI.Vector DB Schema Modification
docSource
.Data Cleaning and Exclusion of Irrelevant Content
Review and Refinement of Chunking Logic
Summarization of Large Documents
Topic Keyword Extraction and Metadata Storage
While Testing Ask Astro today we noticed some issue with below questions
what is Astro SDK? The response for this was incorrect(I'm sorry, but there's no such thing as Astro-SDK.) and the sources were irrelevant. Slack Thread
What is latest version for Astronomer Providers The response was not correct as per Ask Astro latest version is 1.14.0, however, latest release is 1.18.2 Slack Thread
Related docs incorrect for how to install the CLI on Linux? Slack Thread
Ask Astro not recognize version 2.7.3 and treat 2.2.2 as latest Slack Thread
Based on above try the following in the order:
Test iteration for each experiment by @vatsrahul1001
--- Update 1/8 Merging two other similar/duplicate issues to this one, closing the other two ---
https://github.com/astronomer/ask-astro/issues/213 https://github.com/astronomer/ask-astro/issues/80