Support langchain documents

hslee16 commented 5 days ago

Motivation

Rather than using documents within a local directory and use various loaders to generate langchain document instances, let's enable the researcher to use langchain documents directly. This is useful when the existing langchain backend has already created embeddings, documents, and various retrievers.

ElishaKay commented 5 days ago

Morning @hslee16,

Not sure this is related to this PR, but I figured I'd ask anyways.

For the next big release, we're planning on leveraging the langgraph-api - which comes with a postgres service as a dependency.

The architecture will be:

NextJS frontend: port 3000
GPTR Python Service: port 8000
Langgraph API: port 8123
Postgres

You mentioned this PR could be helpful for embedding & retrievers - which is the same direction I'm interested in - especially to support the document Uploader feature (the langgraph-api service needs access to documents uploaded via the service on port 8000)

Questions:

how do we save uploaded documents in postgres?
how do we retrieve?

Perhaps a design review would be helpful if we'd like to implement Langchain Retrievers which leverage a PGVector as proposed by this user in the discord.

hslee16 commented 4 days ago

@ElishaKay

Thanks for your speedy response!

how do we save uploaded documents in postgres?

In Langchain using PGVector, we would use the same way as mentioned in the PGVector documentation:

vectorestore = PGVector.from_documents(
    docs,
    embeddings,
    collection_name=collection,
)

This persists the split/parsed documents along with generated embeddings similar to the following:

Screenshot 2024-06-26 at 7 46 29 PM

how do we retrieve?

This part is a little more trickier. The retriever and the base document object in langchain does not appear to have a "fetch all documents" method. Rather, retrievers expose an invoke (or ainvoke) method where users can specify a query that will perform similarity search on the vector store. Only the documents that are "similar" to the query are returned.

My private project uses langchain with PGVector and thus I am able to generate the set of "documents" that gpt-researcher is using when ReportSource == local

The main goal for me is to leverage the existing documents from PDFs (and other document types) that have already been processed by langchain. Since the local documents used by gpt-research is using various langchain readers, I figured it would be simple enough to bolt on the langchain documents directly.

Please let me know how I can help. If you have existing design documents, I'm happy to take a look. Lastly, the discord link you posted above leads me to a blank channel. Perhaps you can give me access?

My discord user is: hslee16

Screenshot 2024-06-26 at 7 54 31 PM