Closed chimezie closed 10 months ago
Yeah, we absolutely should not be using basic Python string formatting to construct queries. I still haven't had a chance to code review the PGVector bits, so good catch.
@chimezie you working on this, or do you want me to have a go?
Can you give it a go? I tried the change below earlier:
diff --git a/pylib/embedding_helper.py b/pylib/embedding_helper.py
index 620842d..0df8c3c 100644
--- a/pylib/embedding_helper.py
+++ b/pylib/embedding_helper.py
@@ -72,14 +72,7 @@ INSERT INTO {table_name} (
title,
page_numbers,
tags
-) VALUES (
- '{embedding}',
- '{content}',
- '{permission}',
- '{title}',
- '{page_numbers}',
- '{tags}'
-);\
+) VALUES ($1, $2, $3, $4, $5, $6)
'''
SEARCH_DOC_TABLE = '''\
@@ -199,24 +192,14 @@ class PGvectorConnection:
raise ConnectionError('Connection to database is closed')
# Get the embedding of the content as a PGvector compatible list
- content_embedding = list(self._embedding_model.encode(content))
+ content_embedding = self._embedding_model.encode(content)
# Get the page numbers and tags as SQL arrays
SQL_page_numbers = "{{{}}}".format(", ".join(map(str, page_numbers)))
SQL_tags = "{{{}}}".format(", ".join(map(str, tags)))
+ await self.conn.execute(INSERT_DOC_TABLE.format(table_name=table_name), content_embedding.tolist(), content,
+ permission, title, SQL_page_numbers, SQL_tags)
- await self.conn.execute(
- INSERT_DOC_TABLE.format(
- table_name = table_name,
- embedding = content_embedding,
- content = content,
- permission = permission,
- title = title,
- page_numbers = SQL_page_numbers,
- tags = SQL_tags
- )
- )
-
async def search_doc_table(
self,
table_name: str,
But still get:
File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/ogbujipt/embedding_helper.py", line 200, in insert_doc_table await self.conn.execute(INSERT_DOC_TABLE.format(table_name=table_name), contentembedding.tolist(), content, File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 319, in execute , status, _ = await self._execute( File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 1659, in execute result, = await self.execute( File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 1684, in execute return await self._do_execute( File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 1731, in _do_execute result = await executor(stmt, None) File "asyncpg/protocol/protocol.pyx", line 183, in bind_execute File "asyncpg/protocol/prepared_stmt.pyx", line 197, in asyncpg.protocol.protocol.PreparedStatementState._encode_bind_msg asyncpg.exceptions.DataError: invalid input for query argument $1: [0.0005201358580961823, -0.0375559926033... (expected str, got list)
And
asyncpg.exceptions.DataError: invalid input for query argument $1: array([ 5.20135858e-04, -3.75559926e-02,... (expected str, got ndarray)
When I directly pass in the instance of numpy.ndarray directly output from encoding the text (using the 'all-MiniLM-L12-v2' embedded model ), even though this suggests that asyncpg should be able to handle lists as parameterized query arguments. I don't see a vector in the list of PostgreSQL types in that table, but the original code is able to insert embeddings of text without characters needing escaping, so I'm not sure at this point, and I have done a decent amount of digging.
Installing pgvector-python and using its register_vector method seemed to do the trick. Once I registered the vector with the DB connection, I was able to call a parameterized execute method on the connection, passing in the arguments directly (including the nparray embedding of the text), removing the tag and page number escaping, etc.
I had to implement it separately and can post a diff here or put this solution in a separate branch with a test case.
As a separate note/bug/feature request, PGvectorConnection should have an insert_doc_tables (plural) method that leverages the connection.executemany(..) to efficiently insert multiple embeddings at once, which I imagine will be the primary use case.
I had to implement it separately and can post a diff here or put this solution in a separate branch with a test case.
Hey, sorry, just sitting down to look at this now. Sure, could you create a new branch off this ticket? I actually I just did that. Could you do:
git fetch origin
git checkout 45-pgvector-insert
And commit those changes there? I'll review, and maybe add to the test cases. Thanks!
Also, sounds like we have a new dependency? I think we might want now to have a requirements-pgvector.txt
Working on this, and I'm thinking of refactoring PGvectorConnection
. ISTM it is mainly encapsulating 3 things: PG connection, embedding_model & table name. The latter is kinda submarined by being a required parameter for each call, and that was the first thing that I found itchy.
I think what I want is to rename PGvectorConnection
to PGvectorHelper
(reluctantly doing the HumpCase) be initialized with all three of these params, but as a convenience for the case where you *do** want to share an embedding & connection but have multiple table names, etc., have a new_helper()
method which allows you to create the same setup, but with a different table name.
Just jotting down the idea for any feedback, esp. from @choccccy & @chimezie. If no objections I'll finish up in the morning.
That would be a good idea.
The changes look good to me. There was one issue with docDB.insert_docs() that I fixed, tested, and pushed to be4ce30. Seems to be working with that fix
i added a job to .github/workflows/yaml
that spins up a pgvector docker container that allows the testing suite to connect to a mock pgvector database. I added a simple mock test to test_pgvector.py
that mirrors the actions in test_text_w_apostrophe.ipynb
. It's failing currently, but other tests still pass as expected. I'm leaving test_PGv_embed_poem()
alone for now, will take another look at it tomorrow.
Closing now that @choccccy and I are done and the tests are passing with new changes.
The PG Vector embedding helper is very helpful for inserting corpora embeddings into Postgres instances and I have been using it for this purpose. However, it doesn't seem to able to handle text such as:
In particular, it looks like the apostrophe is an issue causing the raising of the PostgresSyntaxError error:
Looking at the way that method is implemented against asyncpg's usage documentation, which shows how SQL query parameters can be sent to the connection execute method (a bonus step towards protection against SQL injection), it seems the insert_doc_table method should be modified to execute parameterized postgres queries to handle situations like this rather than naive, python string formatting of the SQL INSERT query