crate-workbench / langchain

⚡ Building applications with LLMs through composability ⚡
https://python.langchain.com
MIT License
0 stars 0 forks source link

Vector Store: "Collection not found" when using `pre_delete_collection=True` #5

Closed ckurze closed 10 months ago

ckurze commented 11 months ago

Problem

When using pre_delete_collection=True, there is only an error stating "Collection not found", the actual collection is not deleted / emptied.

Details

Example: vector_search.ipynb

COLLECTION_NAME = "state_of_the_union_test"

embeddings = OpenAIEmbeddings()

db = CrateDBVectorSearch.from_documents(
    embedding=embeddings,
    documents=docs,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True, 
)
amotl commented 11 months ago

Hi Christian,

thanks for reporting. I've added a self-contained example program at ^1, but I haven't been able to reproduce the "Collection not found" problem. I tried it with a CrateDB instance already running, and I also tried once more with a recycled one, without any existing tables.

Can I ask you to try again? Maybe the situation was improved in the meanwhile, and the flaw was resolved by some other fix added recently?

On the other hand, maybe my example program is still incomplete, and you would be able to complete it, in order to reproduce the problem?

With kind regards, Andreas.

amotl commented 11 months ago

Indeed, I am also observing problems on the "Overwriting a vector store" section in vector_search.ipynb ^1.

____ notebook: nbregression(vector_search) ____
nbclient.exceptions.CellExecutionError: An error occurred while executing the following cell:
------------------
docs_with_score[0]
------------------

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[20], line 1
----> 1 docs_with_score[0]

IndexError: list index out of range
### Overwriting a vector store

If you have an existing collection, you can overwrite it by using `from_documents`,
and setting `pre_delete_collection = True`.
#%%
db = CrateDBVectorSearch.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name=COLLECTION_NAME,
    connection_string=CONNECTION_STRING,
    pre_delete_collection=True,
)
#%%
docs_with_score = db.similarity_search_with_score("foo")
#%%
docs_with_score[0]
#%% md
amotl commented 11 months ago

We may have been able to reproduce the flaw on behalf of bringing in corresponding software tests for the accompanying Jupyter Notebooks.

pytest -k "notebook and vector"

image

amotl commented 11 months ago

When using pre_delete_collection=True, there is only an error stating "Collection not found".

Indeed, this is the only occurrance of logger.warning within pgvector. In this manner, it feels a bit like a stray log item, but C'est la vie.

$ ag "warning.*collection not found"

libs/langchain/langchain/vectorstores/pgembedding.py
219:                self.logger.warning("Collection not found")

libs/langchain/langchain/vectorstores/pgvector.py
189:                self.logger.warning("Collection not found")

[...] the actual collection is not deleted / emptied.

Will have to be investigated. Can you check again?

amotl commented 11 months ago

[...] the actual collection is not deleted / emptied.

Will have to be investigated.

By using the standalone example program cratedb-langchain-pre-delete-collection.py, you can exercise that the Result count output is different when disabling the pre_delete_collection=True line.

You may need to invoke the program a few times with and without the line to see the difference. I guess this demonstrates it works well?

amotl commented 10 months ago

@andnig just reported GH-11, which may be related to this one?

amotl commented 10 months ago

Hi again. Unless there are any objections, let's consider this fixed?