Closed maspotts closed 2 months ago
Hi @maspotts - not seeing this show-up in unit tests. Can you provide code to reproduce this?
OK so this is really weird! I cannot replicate this when I build a new index and query it in a minimal python script (which replicates ALL of the paperqa-related code from my massive script). But when I build the index in my (massive) python script and then load and query it in my demo script I get the same error. So it's the build process (the index built by my massive script is 'bad'). I drilled in and noticed that with the index built by my massive script I see this:
chabot_index.texts_index.similarity_search('foo')[i].metadata == {'citation': 'XXX', 'dockey': 'YYY', 'key': 'ZZZ'}
whereas with the index built by my demo script I see this:
local_index.texts_index.similarity_search('foo')[i].metadata == {'name': 'ZZZ', 'doc': {'docname': 'YYY', 'citation': 'XXX.', 'dockey': '1ec4d6b213936f22a9a0f6e32c24921c'}}
which has a completely different structure! But I'm running both scripts on the same host, using the same python interpreter, and with the same version of paperqa installed in the same (global) environment, so I'm a bit mind-boggled!
Here's my minimal script for reference:
import os
import pickle
os.environ['OPENAI_API_KEY'] = 'MYKEY'
import paperqa
index = paperqa.Docs(index_path = "tmp-cache-path")
index.add("doc1.pdf")
index.add("doc2pdf")
with open("demo-paperqa-bug-index.pkl", "wb") as file:
pickle.dump(index, file)
## Query the locally-built index (succeeds).
with open("demo-paperqa-bug-index.pkl", "rb") as file:
local_index = pickle.load(file)
print("LOCAL INDEX:")
output = local_index.query("what are some claims?")
print(output)
## Try again with the chatbot-built index (errors out).
with open("chatbots/bots/testy/index.pkl", "rb") as file:
chatbot_index = pickle.load(file)
print("CHATBOT INDEX:")
output = chatbot_index.query("what are some claims?")
print(output)
After a bit more digging: I see that when the index built by the massive script is loaded (from its pickle file) its texts_index
is initialized to a FAISS vector store (inside __setstate__
). But when the index built by my minimal script is loaded its texts_index
is None
, and I believe that's because there is no cached FAISS or (pickled) paperqa index created at the location of index_path
in the initial call to Docs(index_path = "foo")
in the minimal script (although the corresponding call in the massive script DOES create a cache folder at the specified index_path
!). I cannot figure out why the same call:
Docs(index_path = "foo")
succeeds in generating a cache folder in the massive script, but fails (quietly) in the minimal script! Both scripts have the exact same imports, and both are (verified) using paperqa version 3.1.1 and faiss version 1.7.3 .
Aha! I figured it out: doh! It's that index_path
argument to Docs()
that's bitten me again: I had an old cached index at the specified index_path
and that was getting picked up and being turned into a vector store (as texts_index
) on reloading the index. I removed it, and now the exception has disappeared! I assume that that old cached index had that different dict structure because it had been generated from an earlier version of paperqa.
I notice that now, despite still having index_path
specified in my call to Docs()
, the cache directory is NOT being created and populated, which I don't understand, but I actually like, but these cache files have caused me no end of grief! So I've updated my code to: Docs(index_path = None)
, which I arguably should have done a while ago (I think (I thought) at the time that if I didn't specify an explicit index_path
then the cache would be automatically created in ~/.paperqa/default
, which was when I first spent a day scratching my head because multiple chatbots were saving and loading cached FAISS indices from that shared folder without my noticing! But now it seems that if I specify index_path = None
then the cache doesn't get created (or indeed even if I specify a path, it still doesn't get created!), so I guess I'm ok again.
Is there an obvious reason why one CAN specify index_path
? And what is supposed to happen with it? My primary concern is: I don't want any cached index files to be created, because I don't to want to run the risk of different chatbots loading each other's cached index files (which has messed me up twice now!), but I don't know whether setting index_path = None
will actually prevent the index from being created: what's the best way to make sure that no cached index files are created?
Sorry you had so much trouble there! I agree it can be confusing, but I and others depend on loading seamlessly based on name.
The relevant line is here - https://github.com/whitead/paper-qa/blob/0362d08d0813c9a38cd535f240ca03b74863b6be/paperqa/docs.py#L316
So if you do Docs(index_path=None)
in your constructor, it should not save the index. Another option is to set the name of the Docs
object to include the version so it never can load an outdated one - like Docs(name=name + paperqa.__version__)
Seems this is resolved, going to close this out. @maspotts thanks for using paper-qa
, feel free to open a new issue with further questions
After update to latest version of paperqa, this line
response = index.query(input_text, k = k, max_sources = max_sources)
(whereindex = paperqa.Docs(...)
) is now generating aKeyError
exception: