Open GargPriyanshu1112 opened 7 months ago
the code does not use canned packages like LangChain, I don't like them. It does have a fairly complex flow involving multiple interactions with LLMs and other text processing software. Embeddings, I assume you mean for similarity tests across text strings, no. the code uses custom similarity code built around wordfreq to extract sentences most closely related to the (rewritten) query, which are then sent on to an LLM for further processing.
It really isn't that complex, I encourage you to read the code, happy to answer any questions you might have. You will realize the limitations of langchain and deepen your understanding of how one can use llms in a workflow.
cheers - Bruce
On Wed, Dec 13, 2023 at 3:34 AM Priyanshu Garg @.***> wrote:
I wanted to know whether the search methodology employs a thought process (like LangChain agents) or does it makes use of embeddings. Thanks.
— Reply to this email directly, view it on GitHub https://github.com/bdambrosio/llmsearch/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKPHOHMCLHKOEKP55WGBTYJGHEHAVCNFSM6AAAAABATADP5KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTSNJSGI3DOMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Bruce D'Ambrosio
Yeah, I meant similarity tests across text strings using embeddings. My understanding is that searching based on embeddings, as opposed to relying solely on word frequency, can help extract more closely related sentences. This is particularly relevant when dealing with sentences closely related to the query that may not necessarily contain the exact required words. What are your thoughts on this alternative approach?
I'm not sure. Would be interesting to compare the approaches. The primary goal in the keyword selection of sentences is to eliminate much of the junk/formatting/ads etc from the text of a page before sending it to gpt-3.5 for processing. The code first uses unstructured.io to extract text from a page, then downselects sentences using wordfreq. One softener of exact match problems is the code also has gpt-3.5 rewrite the query and extract keywords, then uses both the original and rewritten queries (both for google lookup and for sentence extraction).
I was just messing around, probably lots of silly choices. Once it was working well enough I moved on. The core code is still active in my research assistant, Sam, but I haven't released Sam yet. Work in progress. Sam adds local Wikipedia as well as ARXIV search, and I use Specter2 for embedding heavily in the ARXIV code.
On Wed, Dec 13, 2023 at 9:01 AM Priyanshu Garg @.***> wrote:
Yeah, I meant similarity tests across text strings (or paragraphs) using embeddings. My understanding is that searching based on embeddings, as opposed to relying solely on word frequency, can yield superior results. This is particularly relevant when dealing with sentences closely related to the query that may not necessarily contain the exact required words. What are your thoughts on this alternative approach?
— Reply to this email directly, view it on GitHub https://github.com/bdambrosio/llmsearch/issues/2#issuecomment-1854363194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKPHI6MOQDFK7KIILU6PLYJHNPPAVCNFSM6AAAAABATADP5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGM3DGMJZGQ . You are receiving this because you commented.Message ID: @.***>
-- Bruce D'Ambrosio
I wanted to know whether the search methodology employs a thought process (like LangChain's agent) or does it makes use of embeddings. Thanks. I am new to GenerativeAI. Sorry if this query sounds stupid.