assafelovic / gpt-researcher

GPT based autonomous agent that does online comprehensive research on any given topic
https://gptr.dev
MIT License
12.98k stars 1.61k forks source link

Support langchain documents #630

Open hslee16 opened 5 days ago

hslee16 commented 5 days ago

Motivation

Rather than using documents within a local directory and use various loaders to generate langchain document instances, let's enable the researcher to use langchain documents directly. This is useful when the existing langchain backend has already created embeddings, documents, and various retrievers.

ElishaKay commented 5 days ago

Morning @hslee16,

Not sure this is related to this PR, but I figured I'd ask anyways.

For the next big release, we're planning on leveraging the langgraph-api - which comes with a postgres service as a dependency.

The architecture will be:

You mentioned this PR could be helpful for embedding & retrievers - which is the same direction I'm interested in - especially to support the document Uploader feature (the langgraph-api service needs access to documents uploaded via the service on port 8000)

Questions:

Perhaps a design review would be helpful if we'd like to implement Langchain Retrievers which leverage a PGVector as proposed by this user in the discord.

hslee16 commented 4 days ago

@ElishaKay

Thanks for your speedy response!

how do we save uploaded documents in postgres?

In Langchain using PGVector, we would use the same way as mentioned in the PGVector documentation:

vectorestore = PGVector.from_documents(
    docs,
    embeddings,
    collection_name=collection,
)

This persists the split/parsed documents along with generated embeddings similar to the following:

Screenshot 2024-06-26 at 7 46 29 PM

how do we retrieve?

This part is a little more trickier. The retriever and the base document object in langchain does not appear to have a "fetch all documents" method. Rather, retrievers expose an invoke (or ainvoke) method where users can specify a query that will perform similarity search on the vector store. Only the documents that are "similar" to the query are returned.

My private project uses langchain with PGVector and thus I am able to generate the set of "documents" that gpt-researcher is using when ReportSource == local

The main goal for me is to leverage the existing documents from PDFs (and other document types) that have already been processed by langchain. Since the local documents used by gpt-research is using various langchain readers, I figured it would be simple enough to bolt on the langchain documents directly.

Please let me know how I can help. If you have existing design documents, I'm happy to take a look. Lastly, the discord link you posted above leads me to a blank channel. Perhaps you can give me access?

My discord user is: hslee16

Screenshot 2024-06-26 at 7 54 31 PM

Many thanks in advance!

assafelovic commented 4 days ago

Hey @hslee16 thanks for this PR! Can please also include in the PR a tutorial for how to use this here: https://github.com/assafelovic/gpt-researcher/blob/master/docs/docs/gpt-researcher/tailored-research.md

hslee16 commented 3 days ago

@assafelovic can do!

ElishaKay commented 1 day ago

Loving the concept & initiative @hslee16

To access that discord thread, first accept this invite:

https://discord.gg/spBgZmm3Xe

The discord thread proposed an interesting concept of letting the agent decide what type of report_source to use.