Open g-simmons opened 1 year ago
I think this should be raised in pyzotero rather than here, as we are just calling their zotero downloading API
Thanks for the response. I'll try to make some clarifications for why I think a change in paper-qa might be appropriate. LMK what you think...
I'm aware that paper-qa is using pyzotero, but it looks like the specific download location for pdfs is set within paper-qa.contrib at:
This results in files being downloaded to the storage folder, with a name based on their key. For example a paper titled "Attention is all you Need" with key "ABC123" gets downloaded by contrib.ZoteroDB.get_pdf() to:
<storage>/ABC123.pdf
At least on OSX (I only have evidence from my system), Zotero makes a separate subfolder of the storage folder, with subfolder name based on key, and puts pdfs in the subfolders, with names based on the paper titles. For example a paper titled "Attention is all you Need" with key "ABC123" gets downloaded by the Zotero browser extension to:
<storage>/ABC123/Attention-is-all-you-need.pdf
The suggested change would amount to changing the path in contrib.ZoteroDB.get_pdf() to match the default download path that Zotero uses. The results would be that the already-downloaded pdf is used if it exists, and the pdf downloaded by ZoteroDB gets placed in this location if it doesn't yet exist, with a naming convention that matches Zotero's default naming convention.
This would look something like the below, perhaps with better logging and error handling:
class ZoteroDB(zotero.Zotero):
...
def get_pdf(self, item: dict) -> Union[Path, None]:
"""Gets a filename for a given Zotero key for a PDF.
If the PDF is not found locally, the PDF will be downloaded to a local file at the correct key.
If no PDF exists for the file, None is returned.
Parameters
----------
item : dict
An item from `pyzotero`. Should have a `key` field, and also have an entry
`links->attachment->attachmentType == application/pdf`.
"""
if type(item) != dict:
raise TypeError("Pass the full item of the paper. The item must be a dict.")
pdf_key = self._extract_pdf_key(item)
self.logger.info(f"| PDF key: {pdf_key}")
if pdf_key is None:
return None
# look in the storage directory
pdf_local_path = self.storage / pdf_key
self.logger.info(f"| Local PDF path: {pdf_local_path}")
if pdf_local_path.exists():
self.logger.info(f"| Local PDF path exists: {pdf_local_path}")
pdf_files = list(pdf_local_path.glob("*.pdf"))
self.logger.info(f"| {len(pdf_files )} Local PDF files: {pdf_files}")
if len(pdf_files) == 1:
self.logger.info(f"| Found PDF for {pdf_key} at {pdf_local_path}.")
pdf_path = pdf_local_path / list(pdf_files)[0]
return pdf_path
elif len(pdf_files) > 1:
self.logger.info(f"| More than one PDF found for {pdf_key}.")
title = _get_item_title(item) # to be implemented
pdf_path = self.storage / pdf_key / f"{title}.pdf"
if not pdf_path.exists():
pdf_path.parent.mkdir(parents=True, exist_ok=True)
self.logger.info(
f"| Downloading PDF for: {_get_citation_key(item)}"
)
self.dump(pdf_key, pdf_path)
return pdf_path
Thanks for explaining, I understand your question better now. If you have a working implementation that can use the default Zotero paths, I would definitely be in favor of adding that.
I would also be in favor of treating the local Zotero storage as a read-only directory: if a PDF is found, it is used, otherwise, a PDF is downloaded to the .paperqa directory (rather than to the Zotero storage). (I think this is safer because I’m not sure if it would cause problems if we write to internal Zotero storage rather than letting it sync itself.)
:)
Yeah I was also wondering about interference with Zotero functionality...
In that case, should probably:
What do you think?
Sounds good to me!
I'm on Zotero v6.0.23, on Mac OS, and attachments get stored in the following format:
/Users/myusername/Zotero/storage/key/papertitle.pdf
If I initialize a ZoteroDB with
storage=/Users/myusername/Zotero/storage/
, it seems like ZoteroDB will look for and store attachments as/Users/myusername/Zotero/storage/key.pdf
On my system this leads to some unnecessary redownloading that can be avoided by modifying the behavior of
ZoteroDB.get_pdf()
.Am I understanding this correctly? And do you know if Zotero uses the same directory structure across platforms?