Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.33k stars 599 forks source link

Unable to parse pdf from zotfile(zotero plugin) #60

Open goldengrape opened 1 year ago

goldengrape commented 1 year ago

If zotfile is used, the pdf file cannot be found. This is of course to blame on pyzotero.

MilesCranmer commented 1 year ago

To download PDFs, ZoteroDB just calls pyzotero.zotero.Zotero.dump: https://github.com/whitead/paper-qa/blob/93c47ccc7ea496a1167ac2b89e7fa512dee7be7f/paperqa/contrib/zotero.py#L112

This calls Zotero.file internally: https://github.com/urschrei/pyzotero/blob/c61125751eaa99b5ac8d3d1e8842219fbee6dbf6/pyzotero/zotero.py#L700-L723

Can you give more details? Where is zotfile being used? What is the specific zotero item?

You can set:

import logging
logging.basicConfig(level=logging.DEBUG)

to see more debugging information.

mnemeth66 commented 1 year ago

Haven't gotten around to fully implementing this in paperqa but it seems like the structure for files in Zotero is top files -> children. For files that are saved with Zotero directly, top files include the pdf attachment. For files that are saved with zotfile, pdfs are saved in children, rather than top level items.

@MilesCranmer currently you use Zotero.top() which should catch all files stored in Zotero; to catch other items that call needs to be changed to Zotero.items() and then downloading the PDF based on the link type. See lifan0127's implementation here: https://gist.github.com/lifan0127/e34bb0cfbf7f03dc6852fd3e80b8fb19.

MilesCranmer commented 1 year ago

Ah, sorry, I didn't understand the first question. Thanks for giving me more details, now this makes sense. Also I didn't see lifan's gist before writing the one in ZoteroDB, that would have saved me quite a few days of debugging! :sweat_smile:

Do you want to start a draft PR to fix this? Happy to help get things all working.

goldengrape commented 1 year ago

I gave up on the zotero part, and I now find that it's actually perfectly fine to get the paper directly from the web and then ask questions with paperqa. Storing papers locally for a long time is not necessary.

andreifoldes commented 1 year ago

Hm.. am I right in thinking that if a user exhausts their 300 MB Zotero cloud storage than the files that are outside the limit wont get synced as consequently won't get zoteroDB.iterate'ed?