Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.48k stars 619 forks source link

Directory structure for ZoteroDB? #66

Open g-simmons opened 1 year ago

g-simmons commented 1 year ago

I'm on Zotero v6.0.23, on Mac OS, and attachments get stored in the following format:

/Users/myusername/Zotero/storage/key/papertitle.pdf

If I initialize a ZoteroDB with storage=/Users/myusername/Zotero/storage/, it seems like ZoteroDB will look for and store attachments as

/Users/myusername/Zotero/storage/key.pdf

On my system this leads to some unnecessary redownloading that can be avoided by modifying the behavior of ZoteroDB.get_pdf().

Am I understanding this correctly? And do you know if Zotero uses the same directory structure across platforms?

MilesCranmer commented 1 year ago

I think this should be raised in pyzotero rather than here, as we are just calling their zotero downloading API

g-simmons commented 1 year ago

Thanks for the response. I'll try to make some clarifications for why I think a change in paper-qa might be appropriate. LMK what you think...

Download location is set within paper-qa

I'm aware that paper-qa is using pyzotero, but it looks like the specific download location for pdfs is set within paper-qa.contrib at:

https://github.com/whitead/paper-qa/blob/93c47ccc7ea496a1167ac2b89e7fa512dee7be7f/paperqa/contrib/zotero.py#L107

This results in files being downloaded to the storage folder, with a name based on their key. For example a paper titled "Attention is all you Need" with key "ABC123" gets downloaded by contrib.ZoteroDB.get_pdf() to:

<storage>/ABC123.pdf

Download location in paper-qa doesn't match default Zotero download location

At least on OSX (I only have evidence from my system), Zotero makes a separate subfolder of the storage folder, with subfolder name based on key, and puts pdfs in the subfolders, with names based on the paper titles. For example a paper titled "Attention is all you Need" with key "ABC123" gets downloaded by the Zotero browser extension to:

<storage>/ABC123/Attention-is-all-you-need.pdf

Assumptions:

Suggested change

The suggested change would amount to changing the path in contrib.ZoteroDB.get_pdf() to match the default download path that Zotero uses. The results would be that the already-downloaded pdf is used if it exists, and the pdf downloaded by ZoteroDB gets placed in this location if it doesn't yet exist, with a naming convention that matches Zotero's default naming convention.

This would look something like the below, perhaps with better logging and error handling:

class ZoteroDB(zotero.Zotero):
    ...
    def get_pdf(self, item: dict) -> Union[Path, None]:
        """Gets a filename for a given Zotero key for a PDF.

        If the PDF is not found locally, the PDF will be downloaded to a local file at the correct key.
        If no PDF exists for the file, None is returned.

        Parameters
        ----------
        item : dict
            An item from `pyzotero`. Should have a `key` field, and also have an entry
            `links->attachment->attachmentType == application/pdf`.
        """
        if type(item) != dict:
            raise TypeError("Pass the full item of the paper. The item must be a dict.")

        pdf_key = self._extract_pdf_key(item)

        self.logger.info(f"|  PDF key: {pdf_key}")

        if pdf_key is None:
            return None

        # look in the storage directory
        pdf_local_path = self.storage / pdf_key
        self.logger.info(f"|  Local PDF path: {pdf_local_path}")

        if pdf_local_path.exists():
            self.logger.info(f"|  Local PDF path exists: {pdf_local_path}")
            pdf_files = list(pdf_local_path.glob("*.pdf"))
            self.logger.info(f"|  {len(pdf_files )} Local PDF files: {pdf_files}")
            if len(pdf_files) == 1:
                self.logger.info(f"|   Found PDF for {pdf_key} at {pdf_local_path}.")
                pdf_path = pdf_local_path / list(pdf_files)[0]
                return pdf_path

            elif len(pdf_files) > 1:
                self.logger.info(f"|   More than one PDF found for {pdf_key}.")

        title = _get_item_title(item) # to be implemented

        pdf_path = self.storage / pdf_key / f"{title}.pdf"

        if not pdf_path.exists():
            pdf_path.parent.mkdir(parents=True, exist_ok=True)
            self.logger.info(
                f"|  Downloading PDF for: {_get_citation_key(item)}"
            )
            self.dump(pdf_key, pdf_path)

        return pdf_path
MilesCranmer commented 1 year ago

Thanks for explaining, I understand your question better now. If you have a working implementation that can use the default Zotero paths, I would definitely be in favor of adding that.

MilesCranmer commented 1 year ago

I would also be in favor of treating the local Zotero storage as a read-only directory: if a PDF is found, it is used, otherwise, a PDF is downloaded to the .paperqa directory (rather than to the Zotero storage). (I think this is safer because I’m not sure if it would cause problems if we write to internal Zotero storage rather than letting it sync itself.)

g-simmons commented 1 year ago

:)

Yeah I was also wondering about interference with Zotero functionality...

In that case, should probably:

What do you think?

MilesCranmer commented 1 year ago

Sounds good to me!