Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
6.41k stars 609 forks source link

KeyError: 'doc' exception from index.query() after latest update #152

Closed maspotts closed 2 months ago

maspotts commented 1 year ago

After update to latest version of paperqa, this line response = index.query(input_text, k = k, max_sources = max_sources) (where index = paperqa.Docs(...)) is now generating a KeyError exception:

Traceback (most recent call last):
  File "/Users/mike/src/chatbot/./chatbot", line 5005, in <module>
    answer = bot.query(args.query)
  File "/Users/mike/src/chatbot/./chatbot", line 2417, in query
    responses = async_handler(self.indices, sync_func, async_func, features, options,
  File "/Users/mike/src/chatbot/./chatbot", line 902, in async_handler
    outputs.append(sync_func(input, **kwargs))
  File "/Users/mike/src/chatbot/./chatbot", line 2407, in sync_func
    response = index.query(input_text, k = k, max_sources = max_sources)
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 488, in query
    return loop.run_until_complete(
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 525, in aquery
    answer = await self.aget_evidence(
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 399, in aget_evidence
    matches = [
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 402, in <listcomp>
    if m.metadata["doc"]["dockey"] not in self.deleted_dockeys
KeyError: 'doc'
whitead commented 1 year ago

Hi @maspotts - not seeing this show-up in unit tests. Can you provide code to reproduce this?

maspotts commented 1 year ago

OK so this is really weird! I cannot replicate this when I build a new index and query it in a minimal python script (which replicates ALL of the paperqa-related code from my massive script). But when I build the index in my (massive) python script and then load and query it in my demo script I get the same error. So it's the build process (the index built by my massive script is 'bad'). I drilled in and noticed that with the index built by my massive script I see this:

chabot_index.texts_index.similarity_search('foo')[i].metadata == {'citation': 'XXX', 'dockey': 'YYY', 'key': 'ZZZ'}

whereas with the index built by my demo script I see this:

local_index.texts_index.similarity_search('foo')[i].metadata == {'name': 'ZZZ', 'doc': {'docname': 'YYY', 'citation': 'XXX.', 'dockey': '1ec4d6b213936f22a9a0f6e32c24921c'}}

which has a completely different structure! But I'm running both scripts on the same host, using the same python interpreter, and with the same version of paperqa installed in the same (global) environment, so I'm a bit mind-boggled!

Here's my minimal script for reference:

import os
import pickle

os.environ['OPENAI_API_KEY'] = 'MYKEY'

import paperqa

index = paperqa.Docs(index_path = "tmp-cache-path")

index.add("doc1.pdf")
index.add("doc2pdf")

with open("demo-paperqa-bug-index.pkl", "wb") as file:
    pickle.dump(index, file)

## Query the locally-built index (succeeds).

with open("demo-paperqa-bug-index.pkl", "rb") as file:
    local_index = pickle.load(file)

print("LOCAL INDEX:")
output = local_index.query("what are some claims?")
print(output)

## Try again with the chatbot-built index (errors out).

with open("chatbots/bots/testy/index.pkl", "rb") as file:
    chatbot_index = pickle.load(file)

print("CHATBOT INDEX:")
output = chatbot_index.query("what are some claims?")
print(output)
maspotts commented 1 year ago

After a bit more digging: I see that when the index built by the massive script is loaded (from its pickle file) its texts_index is initialized to a FAISS vector store (inside __setstate__). But when the index built by my minimal script is loaded its texts_index is None, and I believe that's because there is no cached FAISS or (pickled) paperqa index created at the location of index_path in the initial call to Docs(index_path = "foo") in the minimal script (although the corresponding call in the massive script DOES create a cache folder at the specified index_path!). I cannot figure out why the same call:

Docs(index_path = "foo")

succeeds in generating a cache folder in the massive script, but fails (quietly) in the minimal script! Both scripts have the exact same imports, and both are (verified) using paperqa version 3.1.1 and faiss version 1.7.3 .

maspotts commented 1 year ago

Aha! I figured it out: doh! It's that index_path argument to Docs() that's bitten me again: I had an old cached index at the specified index_path and that was getting picked up and being turned into a vector store (as texts_index) on reloading the index. I removed it, and now the exception has disappeared! I assume that that old cached index had that different dict structure because it had been generated from an earlier version of paperqa.

I notice that now, despite still having index_path specified in my call to Docs(), the cache directory is NOT being created and populated, which I don't understand, but I actually like, but these cache files have caused me no end of grief! So I've updated my code to: Docs(index_path = None), which I arguably should have done a while ago (I think (I thought) at the time that if I didn't specify an explicit index_path then the cache would be automatically created in ~/.paperqa/default, which was when I first spent a day scratching my head because multiple chatbots were saving and loading cached FAISS indices from that shared folder without my noticing! But now it seems that if I specify index_path = None then the cache doesn't get created (or indeed even if I specify a path, it still doesn't get created!), so I guess I'm ok again.

Is there an obvious reason why one CAN specify index_path? And what is supposed to happen with it? My primary concern is: I don't want any cached index files to be created, because I don't to want to run the risk of different chatbots loading each other's cached index files (which has messed me up twice now!), but I don't know whether setting index_path = None will actually prevent the index from being created: what's the best way to make sure that no cached index files are created?

whitead commented 1 year ago

Sorry you had so much trouble there! I agree it can be confusing, but I and others depend on loading seamlessly based on name.

The relevant line is here - https://github.com/whitead/paper-qa/blob/0362d08d0813c9a38cd535f240ca03b74863b6be/paperqa/docs.py#L316

So if you do Docs(index_path=None) in your constructor, it should not save the index. Another option is to set the name of the Docs object to include the version so it never can load an outdated one - like Docs(name=name + paperqa.__version__)

jamesbraza commented 2 months ago

Seems this is resolved, going to close this out. @maspotts thanks for using paper-qa, feel free to open a new issue with further questions