Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.22k stars 583 forks source link

EmptyDocsError: Not gathering evidence due to having no papers. #470

Open SURUIYUAN opened 3 weeks ago

SURUIYUAN commented 3 weeks ago

Why show ”EmptyDocsError: Not gathering evidence due to having no papers.“ It can not detect my paper?

dosubot[bot] commented 3 weeks ago

The error "EmptyDocsError: Not gathering evidence due to having no papers." appears because the gather_evidence method in the GatherEvidence class checks if there are any documents present in the state.docs.docs. If no documents are found, it raises the EmptyDocsError. This indicates that no papers have been detected or added to the state before calling the gather_evidence method [1].

To resolve this, ensure that papers are correctly added to state.docs.docs using methods like aadd_texts in the Docs class. This can be done through the paper_search function or by directly adding documents to the Docs object [1][2].

To continue talking to Dosu, mention @dosu.

SURUIYUAN commented 3 weeks ago

How to modify in python program, explain in detail.

dosubot[bot] commented 3 weeks ago

To ensure that papers are correctly added to state.docs.docs and avoid the "EmptyDocsError," you need to verify and possibly modify the paper_search method in the PaperSearch class. Here is a detailed explanation of the necessary steps:

  1. Ensure paper_search Method Updates state.docs.docs: The paper_search method should correctly update the state.docs.docs attribute by adding documents using the aadd_texts method. Here is the relevant part of the code:

    async def paper_search(
       self,
       query: str,
       min_year: int | None,
       max_year: int | None,
       state: EnvironmentState,
    ) -> str:
       # ... (other code)
    
       # combine all the resulting doc objects into one and update the state
       all_docs = []
       for r in results:
           this_doc = next(iter(r.docs.values()))
           all_docs.append(this_doc)
           await state.docs.aadd_texts(
               texts=r.texts,
               doc=this_doc,
               settings=self.settings,
               embedding_model=self.embedding_model,
           )
    
       status = state.status
       logger.info(status)
       # mark how far we've searched so that continuation will start at the right place
       self.previous_searches[search_key] += self.settings.agent.search_count
       if self.settings.agent.return_paper_metadata:
           retrieved_papers = "\n".join([f"{x.title} ({x.year})" for x in all_docs])
           return f"Retrieved Papers:\n{retrieved_papers}\n\n{status}"
       return status
  2. Verify results from index.query: Ensure that the results from the index.query call are not empty. If results is empty, no documents will be added to state.docs.docs, leading to the "EmptyDocsError".

  3. Check aadd_texts Method: The aadd_texts method in the Docs class should correctly add texts to the docs dictionary. Here is the method signature and its parameters:

    async def aadd_texts(
       self,
       texts: list[Text],
       doc: Doc,
       settings: MaybeSettings = None,
       embedding_model: EmbeddingModel | None = None,
    ) -> bool:

    Ensure that this method is functioning correctly and that the texts and doc parameters are properly passed.

  4. Ensure Docs Class is Correctly Implemented: The Docs class should manage the docs dictionary and other related attributes properly. Here is a snippet of the Docs class for reference:

    class Docs(BaseModel):
       """A collection of documents to be used for answering questions."""
    
       model_config = ConfigDict(extra="forbid")
    
       id: UUID = Field(default_factory=uuid4)
       docs: dict[DocKey, Doc | DocDetails] = Field(default_factory=dict)
       texts: list[Text] = Field(default_factory=list)
       docnames: set[str] = Field(default_factory=set)
       texts_index: VectorStore = Field(default_factory=NumpyVectorStore)
       name: str = Field(default="default", description="Name of this docs collection")
       index_path: Path | None = Field(
           default=PAPERQA_DIR, description="Path to save index", validate_default=True
       )
       deleted_dockeys: set[DocKey] = Field(default_factory=set)
    
       @field_validator("index_path")
       @classmethod
       def handle_default(cls, value: Path | None, info: ValidationInfo) -> Path | None:
           if value == PAPERQA_DIR:
               return PAPERQA_DIR / info.data["name"]
           return value
    
       def clear_docs(self) -> None:
           self.texts = []
           self.docs = {}
           self.docnames = set()
    
       def _get_unique_name(self, docname: str) -> str:
           """Create a unique name given proposed name."""
           suffix = ""
           while (docname + suffix) in self.docnames:
               suffix = "a" if suffix == "" else chr(ord(suffix) + 1)
           docname += suffix
           return docname
    
       async def aadd_file(
           self,
           file: BinaryIO,
           citation: str | None = None,
           docname: str | None = None,
           dockey: DocKey | None = None,
           title: str | None = None,
           doi: str | None = None,
           authors: list[str] | None = None,
           settings: MaybeSettings = None,
           llm_model: LLMModel | None = None,
           embedding_model: EmbeddingModel | None = None,
           **kwargs,
       ) -> str | None:
           """Add a document to the collection."""
           # just put in temp file and use existing method
           suffix = ".txt"
           if maybe_is_pdf(file):
               suffix = ".pdf"
           elif maybe_is_html(file):
               suffix = ".html"
    
           with tempfile.NamedTemporaryFile(suffix=suffix) as f:
               f.write(file.read())
               f.seek(0)
               return await self.aadd(
                   Path(f.name),
                   citation=citation,
                   docname=docname,
                   dockey=dockey,
                   title=title,
                   doi=doi,
                   authors=authors,
                   settings=settings,
                   llm_model=llm_model,
                   embedding_model=embedding_model,
                   **kwargs,
               )
    
       async def aadd(
           self,
           path: Path,
           citation: str | None = None,
           docname: str | None = None,
           dockey: DocKey | None = None,
           title: str | None = None,
           doi: str | None = None,
           authors: list[str] | None = None,
           settings: MaybeSettings = None,
           llm_model: LLMModel | None = None,
           embedding_model: EmbeddingModel | None = None,
           **kwargs,
       ) -> str | None:
           """Add a document to the collection."""
           all_settings = get_settings(settings)
           parse_config = all_settings.parsing
           if dockey is None:
               dockey = md5sum(path)
           if llm_model is None:
               llm_model = all_settings.get_llm()
           if citation is None:
               texts = read_doc(
                   path,
                   Doc(docname="", citation="", dockey=dockey),  # Fake doc
                   chunk_chars=parse_config.chunk_size,
                   overlap=parse_config.overlap,
               )
               if not texts:
                   raise ValueError(f"Could not read document {path}. Is it empty?")
               result = await llm_model.run_prompt(
                   prompt=parse_config.citation_prompt,
                   data={"text": texts[0].text},
                   skip_system=True,
               )
               citation = result.text
               if (
                   len(citation) < 3
                   or "Unknown" in citation
                   or "insufficient" in citation
               ):
                   citation = f"Unknown, {os.path.basename(path)}, {datetime.now().year}"
           if docname is None:
               match = re.search(r"([A-Z][a-z]+)", citation)
               if match is not None:
                   author = match.group(1)
               else:
                   raise ValueError(
                       f"Could not parse docname from citation {citation}. "
                       "Consider just passing key explicitly - e.g. docs.py "
                       "(path, citation, key='mykey')"
                   )
               year = ""
               match = re.search(r"(\d{4})", citation)
               if match is not None:
                   year = match.group(1)
               docname = f"{author}{year}"
           docname = self._get_unique_name(docname)
    
           doc = Doc(docname=docname, citation=citation, dockey=dockey)
           self.docs[dockey] = doc
           self.docnames.add(docname)
           self.texts.extend(texts)
           self._build_texts_index()
           return docname

By ensuring that the paper_search method correctly updates state.docs.docs and verifying the implementation of the Docs class and its methods, you can avoid the "EmptyDocsError" and ensure that papers are correctly detected and added to the state [1][2].

To continue talking to Dosu, mention @dosu.