assafelovic / gpt-researcher

GPT based autonomous agent that does online comprehensive research on any given topic
https://gptr.dev
MIT License
12.98k stars 1.61k forks source link

References from previous report included in current report? #623

Open danieldekay opened 6 days ago

danieldekay commented 6 days ago

I have just done two GPT-researcher reports in succession using the web interface, with the master branch (hash: 557207f424ccc92b84eca3ec9a15182f66e5ed43)

Setup:

There are many many references cited in the list of reference in Report 2 that were originally found for Report 1.

Seems to be a bug?

Mitchell-xiyunfeng commented 6 days ago

Two important questions please:

  1. First important question: Will GPT Researcher deployed on the RepoCloud also be able to synchronize updates once these improvements are completed?
  2. The second important question: How does GPT Researcher deployed on the RepoCloud use local documents for knowledge QA? After making sure to add the DOC_PATH environment variable pointing to the documents folder.
adrianhensler commented 2 days ago

I have just done two GPT-researcher reports in succession using the web interface, with the master branch (hash: 557207f)

Setup:

  • AzureOpenAI
  • Bing search

There are many many references cited in the list of reference in Report 2 that were originally found for Report 1.

Seems to be a bug?

I had a similar issue; using Tavily with OpenAI. The references at the end seemed mixed with two earlier searches.

I wonder if possibly I had multiple sessions (chrome tabs) open? Or is it potentially something not being cleared between runs.

adrianhensler commented 2 days ago

I asked ChatGPT to review; here is the shared chat: https://chatgpt.com/share/09b69dec-842e-4601-afe6-fa1029621ceb

Findings from ChatGPT: "The conduct_research method initializes the research process and sets up the context for the research task. However, it does not explicitly reset or clear the visited_urls set or the source_urls list before starting new research. This could lead to references from previous runs being carried over.

To address this issue, we should ensure that visited_urls and source_urls are cleared at the start of each research task. This can be done by modifying the conduct_research method to reset these attributes at the beginning.

Here is a proposed modification to the conduct_research method to include resetting visited_urls and source_urls:"

`async def conduct_research(self): """ Runs the GPT Researcher to conduct research """

Reset visited_urls and source_urls at the start of each research task

self.visited_urls.clear()
self.source_urls = []
adrianhensler commented 2 days ago

Quick test seems to validate this addresses the error; in agent.py add the lines to reset visited_urls and source_urls.

I'll create a pull request.

async def conduct_research(self): """ Runs the GPT Researcher to conduct research """

Reset visited_urls and source_urls at the start of each research task

    self.visited_urls.clear()
    self.source_urls = []