Closed maspotts closed 1 month ago
Here's a minimal example:
import os
from paperqa import Docs
from langchain_openai import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
os.environ['OPENAI_API_KEY'] = '<OBFUSCATED>'
generator_llm = ChatOpenAI(model_name = 'gpt-4o')
embedding_llm = OpenAIEmbeddings(model = 'text-embedding-3-large')
docs = Docs(index_path = None, llm = 'langchain', embedding_client = embedding_llm, client = generator_llm)
docs.add("dummy.pdf")
Result:
Traceback (most recent call last):
File "/Users/mike/src/chatbot/demo-paperqa-langchain-bug.py", line 11, in <module>
docs.add("dummy.pdf")
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 353, in add
return loop.run_until_complete(
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/asyncio/base_events.py", line 646, in run_until_complete
return future.result()
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 426, in aadd
if await self.aadd_texts(texts, doc):
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 455, in aadd_texts
await self.texts_index.embedding_model.embed_documents(
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/llms.py", line 161, in embed_documents
return await embed_documents(cast(AsyncOpenAI, client), texts, self.name)
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/llms.py", line 132, in embed_documents
response = await client.embeddings.create(
AttributeError: 'OpenAIEmbeddings' object has no attribute 'embeddings'
I did notice in the README:
from paperqa import Docs, LangchainEmbeddingModel
docs = Docs(embedding_model=LangchainEmbeddingModel(), embedding_client=OpenAIEmbeddings())
but there is no embedding_model
argument to Docs.__init__()
, and when I try that I get:
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/paperqa/docs.py", line 119, in __init__
super().__init__(**data)
File "/Users/mike/.pyenv/versions/3.10.7/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pydantic/main.py", line 171, in __init__
self.__pydantic_validator__.validate_python(data, self_instance=self)
pydantic_core._pydantic_core.ValidationError: 1 validation error for Docs
embedding_model
Extra inputs are not permitted [type=extra_forbidden, input_value=LangchainEmbeddingModel(name='langchain'), input_type=LangchainEmbeddingModel]
For further information visit https://errors.pydantic.dev/2.6/v/extra_forbidden
Is there a working example I can follow?
OK so I tried the test_langchain_embeddings()
test and it failed with the same exception. I verified that I'm using paperqa 4.9.0. I'm very confused! Presumably test_langchain_embeddings()
must work when you run your tests, or the release wouldn't have succeeded? Can someone tell me what I'm doing wrong? This is blocking me at the moment :(
OK I think I've figured it out (or at least why I"m getting this exception): the test calls:
docs = Docs(
texts_index=NumpyVectorStore(embedding_model=LangchainEmbeddingModel()),
docs_index=NumpyVectorStore(embedding_model=LangchainEmbeddingModel()),
embedding_client=OpenAIEmbeddings(),
)
but I was not creating an explicit texts_index
(or docs_index
). When I add those it works. I guess maybe the README needs to be updated to reflect that? Currently it just shows:
docs = Docs(embedding_model=LangchainEmbeddingModel(), embedding_client=OpenAIEmbeddings())
which fails.
So, is there anything in particular I shoudl keep in mind when calling & passing NumpyVectorStore()
to Docs()
like this? Will this example give me the equivalent of just calling Docs()
without specifying an embedding client?
Thanks for bringing this up @maspotts -- you're totally right that the readme needs some updates. It currently uses a deprecated Docs
argument. I'm putting in a PR with Readme changes that highlight this implementation in more detail.
There are two ways to specify the embedding model, either by using the embedding
argument (i.e. Docs(embedding="text-embedding-3-large")
, which supports OpenAI models, VoyageAI models, Sentence Transformers models and "hybrid" models with some sensible defaults. If you'd like to customize more, like choosing a vector store or giving arguments to your embedding model, then you need to manually specify the indexes just like you found in the tests.
Thanks: got it.
I'm trying to supply a langchain embedding LLM (an instance of OpenAIEmbeddings) via:
Docs(index_path = None, llm = 'langchain', embedding_client = embedding_llm)
but when I try to index a document I get:Am I misunderstanding the usage?