Future-House / paper-qa

High accuracy RAG for answering questions from scientific documents with citations
Apache License 2.0
5.49k stars 520 forks source link

exception: 'OpenAIEmbeddings' object has no attribute 'embeddings' when supplying langchain LLM as embedding_client #304

Closed maspotts closed 1 month ago

maspotts commented 1 month ago

I'm trying to supply a langchain embedding LLM (an instance of OpenAIEmbeddings) via: Docs(index_path = None, llm = 'langchain', embedding_client = embedding_llm) but when I try to index a document I get:

  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/llms.py", line 132, in embed_documents
    response = await client.embeddings.create(
  File "/Users/mike/src/chatbot/./chatbot", line 2387, in __getattr__
    return getattr(self.handle, name)
AttributeError: 'OpenAIEmbeddings' object has no attribute 'embeddings'

Am I misunderstanding the usage?

maspotts commented 1 month ago

Here's a minimal example:

import os
from paperqa import Docs
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings

os.environ['OPENAI_API_KEY'] = '<OBFUSCATED>'

generator_llm = ChatOpenAI(model_name = 'gpt-4o')
embedding_llm = OpenAIEmbeddings(model = 'text-embedding-3-large')
docs = Docs(index_path = None, llm = 'langchain', embedding_client = embedding_llm, client = generator_llm)
docs.add("dummy.pdf")

Result:

Traceback (most recent call last):
  File "/Users/mike/src/chatbot/demo-paperqa-langchain-bug.py", line 11, in <module>
    docs.add("dummy.pdf")
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 353, in add
    return loop.run_until_complete(
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
    return future.result()
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 426, in aadd
    if await self.aadd_texts(texts, doc):
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 455, in aadd_texts
    await self.texts_index.embedding_model.embed_documents(
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/llms.py", line 161, in embed_documents
    return await embed_documents(cast(AsyncOpenAI, client), texts, self.name)
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/llms.py", line 132, in embed_documents
    response = await client.embeddings.create(
AttributeError: 'OpenAIEmbeddings' object has no attribute 'embeddings'
maspotts commented 1 month ago

I did notice in the README:

from paperqa import Docs, LangchainEmbeddingModel
docs = Docs(embedding_model=LangchainEmbeddingModel(), embedding_client=OpenAIEmbeddings())

but there is no embedding_model argument to Docs.__init__(), and when I try that I get:

  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 119, in __init__
    super().__init__(**data)
  File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__
    self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Docs
embedding_model
  Extra inputs are not permitted [type=extra_forbidden, input_value=LangchainEmbeddingModel(name='langchain'), input_type=LangchainEmbeddingModel]
    For further information visit https://errors.pydantic.dev/2.6/v/extra_forbidden

Is there a working example I can follow?

maspotts commented 1 month ago

OK so I tried the test_langchain_embeddings() test and it failed with the same exception. I verified that I'm using paperqa 4.9.0. I'm very confused! Presumably test_langchain_embeddings() must work when you run your tests, or the release wouldn't have succeeded? Can someone tell me what I'm doing wrong? This is blocking me at the moment :(

maspotts commented 1 month ago

OK I think I've figured it out (or at least why I"m getting this exception): the test calls:

    docs = Docs(
        texts_index=NumpyVectorStore(embedding_model=LangchainEmbeddingModel()),
        docs_index=NumpyVectorStore(embedding_model=LangchainEmbeddingModel()),
        embedding_client=OpenAIEmbeddings(),
    )

but I was not creating an explicit texts_index (or docs_index). When I add those it works. I guess maybe the README needs to be updated to reflect that? Currently it just shows:

docs = Docs(embedding_model=LangchainEmbeddingModel(), embedding_client=OpenAIEmbeddings())

which fails.

So, is there anything in particular I shoudl keep in mind when calling & passing NumpyVectorStore() to Docs() like this? Will this example give me the equivalent of just calling Docs() without specifying an embedding client?

mskarlin commented 1 month ago

Thanks for bringing this up @maspotts -- you're totally right that the readme needs some updates. It currently uses a deprecated Docs argument. I'm putting in a PR with Readme changes that highlight this implementation in more detail.

There are two ways to specify the embedding model, either by using the embedding argument (i.e. Docs(embedding="text-embedding-3-large"), which supports OpenAI models, VoyageAI models, Sentence Transformers models and "hybrid" models with some sensible defaults. If you'd like to customize more, like choosing a vector store or giving arguments to your embedding model, then you need to manually specify the indexes just like you found in the tests.

maspotts commented 1 month ago

Thanks: got it.