Document-Based Research Does Not Work Out of the Box

yigit353 commented 1 month ago

Describe the bug The document-based query needs to work more intuitively in Next.js-based app. If DOC_PATH is not set, it should default to location ./my-docs, consistent with the default upload location if nothing is provided.

To Reproduce

if there is no DOC_PATH in the environment, the document-based query does not work
setting DOC_PATH in the API Variables UI didn't have an effect

Expected behavior

servers might not want to set DOC_PATH for having a consistent default behavior
users might want to use prompts rather than just queries
there should be a fallback mechanism for retrieving nothing

assafelovic commented 1 month ago

You’ve tried it with the React based app?

yigit353 commented 1 month ago

You’ve tried it with the React based app?

Yes, not the static one

assafelovic commented 1 month ago

@ElishaKay can you take a look? Thanks for raising this @yigit353 . If you try the pip package it should work for now until we get it resolved. @ElishaKay will follow up with questions/update tomorrow

ElishaKay commented 1 month ago

Thanks for raising @yigit353

Some important edge cases here.

a) It sounds like some of these issues might be solved by:

setting DOC_PATH=./my-docs in the .env.example file
updating anywhere in the code where os.getenv("DOC_PATH", "") to os.getenv("DOC_PATH", "./my-docs")
update the nginx file to also support the setConfig API route (see bullet D below)

Would you like to create the PR for that, so we can carve your name deeper into the tree?

b) You wrote in the discord: "I can see the file uploaded to the local Docker containers file system. However, when I run 'My Documents' report generation, I get the following: 🤷 Failed to load any documents!"

And you shared these logs:

Empty string is because:
INFO:     [19:38:31] 📚 Getting relevant content based on query: Do a simple analysis...
INFO:     [19:38:31] 🤷 No content found for 'Do a simple analysis'...

We need to investigate: what's the difference between

🤷 Failed to load any documents
🤷 No content found for 'Do a simple analysis'

I'm guessing: "🤷 No content found for 'Do a simple analysis'" might be due to the fact that when a vector search is done on the phrase 'Do a simple analysis' it might not find vector results for that phrase, even if the document has been embedded successfully.

I've had similar cases where I needed to be very specific about my question related to documents, in order for the answer to be meaningful.

c) Once we understand the above 2, we'll add some meaningful popup message for the user on the frontend.

d) Regarding what you mentioned: setting DOC_PATH in the API Variables UI didn't have an effect, what error are you seeing in the Network tab?

Perhaps we need to update the nginx file to also support the setConfig API route.

P.S. @assafelovic, we made some changes to the docs & don't see them deployed?

e) If we're already meditating on this, @assafelovic, I saw in an email that Langgraph Cloud was it out of beta, so we may want to add support for saving embedded documents within Postgres - that way the Multi_Agents flow triggered on Langgraph Cloud can fetch documents that are uploaded to localhost:8000

f) @yigit353 let's do a separate discussion for users might want to use prompts rather than just queries - I like the direction - happy to hear your vision around this

yigit353 commented 1 month ago

d) @ElishaKay I see no error in the console but in the backend I get:

 File "/usr/src/app/gpt_researcher/master/agent.py", line 133, in conduct_research
    document_data = await DocumentLoader(self.cfg.doc_path).load()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is due to not setting self.cfg from a source, and this leads to websocket_manager.py:

async def run_agent(task, report_type, report_source, source_urls, tone: Tone, websocket, headers=None):
    """Run the agent."""
    # measure time
    start_time = datetime.datetime.now()
    # add customized JSON config file path here
    config_path = ""

Here, it looks for a customized config path that is hardcoded to become empty. Thus, it cannot save ApiVariables from the Modal and thus self.cfg always becomes None. The retrieval of ApiVariables is basically only dependent on the environment variables; Modal serves no purpose in this setting.

Changing setConfig in the nginx config had no effect when I tried it on the server.

yigit353 commented 1 month ago

@ElishaKay I also added this PR https://github.com/assafelovic/gpt-researcher/pull/863 However, possibly Modal having no effect on ApiVariables deserves its own issue. Also, if nothing is retrieved, do not generate a report.

assafelovic / gpt-researcher

Document-Based Research Does Not Work Out of the Box #858