Query Regarding Search Methodology

GargPriyanshu1112 commented 7 months ago

I wanted to know whether the search methodology employs a thought process (like LangChain's agent) or does it makes use of embeddings. Thanks. I am new to GenerativeAI. Sorry if this query sounds stupid.

bdambrosio commented 7 months ago

the code does not use canned packages like LangChain, I don't like them. It does have a fairly complex flow involving multiple interactions with LLMs and other text processing software. Embeddings, I assume you mean for similarity tests across text strings, no. the code uses custom similarity code built around wordfreq to extract sentences most closely related to the (rewritten) query, which are then sent on to an LLM for further processing.

It really isn't that complex, I encourage you to read the code, happy to answer any questions you might have. You will realize the limitations of langchain and deepen your understanding of how one can use llms in a workflow.

cheers - Bruce

On Wed, Dec 13, 2023 at 3:34 AM Priyanshu Garg @.***> wrote:

I wanted to know whether the search methodology employs a thought process (like LangChain agents) or does it makes use of embeddings. Thanks.

— Reply to this email directly, view it on GitHub https://github.com/bdambrosio/llmsearch/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKPHOHMCLHKOEKP55WGBTYJGHEHAVCNFSM6AAAAABATADP5KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZTSNJSGI3DOMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Bruce D'Ambrosio

GargPriyanshu1112 commented 7 months ago

Yeah, I meant similarity tests across text strings using embeddings. My understanding is that searching based on embeddings, as opposed to relying solely on word frequency, can help extract more closely related sentences. This is particularly relevant when dealing with sentences closely related to the query that may not necessarily contain the exact required words. What are your thoughts on this alternative approach?

bdambrosio commented 7 months ago

I'm not sure. Would be interesting to compare the approaches. The primary goal in the keyword selection of sentences is to eliminate much of the junk/formatting/ads etc from the text of a page before sending it to gpt-3.5 for processing. The code first uses unstructured.io to extract text from a page, then downselects sentences using wordfreq. One softener of exact match problems is the code also has gpt-3.5 rewrite the query and extract keywords, then uses both the original and rewritten queries (both for google lookup and for sentence extraction).

I was just messing around, probably lots of silly choices. Once it was working well enough I moved on. The core code is still active in my research assistant, Sam, but I haven't released Sam yet. Work in progress. Sam adds local Wikipedia as well as ARXIV search, and I use Specter2 for embedding heavily in the ARXIV code.

On Wed, Dec 13, 2023 at 9:01 AM Priyanshu Garg @.***> wrote:

Yeah, I meant similarity tests across text strings (or paragraphs) using embeddings. My understanding is that searching based on embeddings, as opposed to relying solely on word frequency, can yield superior results. This is particularly relevant when dealing with sentences closely related to the query that may not necessarily contain the exact required words. What are your thoughts on this alternative approach?

— Reply to this email directly, view it on GitHub https://github.com/bdambrosio/llmsearch/issues/2#issuecomment-1854363194, or unsubscribe https://github.com/notifications/unsubscribe-auth/AARKPHI6MOQDFK7KIILU6PLYJHNPPAVCNFSM6AAAAABATADP5KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJUGM3DGMJZGQ . You are receiving this because you commented.Message ID: @.***>

-- Bruce D'Ambrosio

bdambrosio / llmsearch

Query Regarding Search Methodology #2