OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
101 stars 8 forks source link

The insert_doc_table method of PGvectorConnection in ogbujipt.embedding_helper doesn't take advantage of native string escaping #45

Closed chimezie closed 10 months ago

chimezie commented 10 months ago

The PG Vector embedding helper is very helpful for inserting corpora embeddings into Postgres instances and I have been using it for this purpose. However, it doesn't seem to able to handle text such as:

Structure of visceral layer of Bowman's capsule is a glomerular capsule structure and a structure of epithelium.

In particular, it looks like the apostrophe is an issue causing the raising of the PostgresSyntaxError error:

File [..]ogbujipt/embedding_helper.py:208, in PGvectorConnection.insert_doc_table(self, table_name, content, permission, title, page_numbers, tags) 205 SQL_page_numbers = "{{{}}}".format(", ".join(map(str, page_numbers))) 206 SQL_tags = "{{{}}}".format(", ".join(map(str, tags))) --> 208 await self.conn.execute( 209 INSERT_DOC_TABLE.format( 210 table_name = table_name, 211 embedding = content_embedding, 212 content = content, 213 permission = permission, 214 title = title, 215 page_numbers = SQL_page_numbers, 216 tags = SQL_tags 217 ) 218 ) File [..]asyncpg/connection.py:317, in Connection.execute(self, query, timeout, *args) 314 self._check_open() 316 if not args: --> 317 return await self.protocol.query(query, timeout) 319 , status, _ = await self._execute( 320 query, 321 args, (...) 324 return_status=True, 325 ) 326 return status.decode() File [..]/asyncpg/protocol/protocol.pyx:338, in query() PostgresSyntaxError: syntax error at or near "s"

Looking at the way that method is implemented against asyncpg's usage documentation, which shows how SQL query parameters can be sent to the connection execute method (a bonus step towards protection against SQL injection), it seems the insert_doc_table method should be modified to execute parameterized postgres queries to handle situations like this rather than naive, python string formatting of the SQL INSERT query

uogbuji commented 10 months ago

Yeah, we absolutely should not be using basic Python string formatting to construct queries. I still haven't had a chance to code review the PGVector bits, so good catch.

uogbuji commented 10 months ago

@chimezie you working on this, or do you want me to have a go?

chimezie commented 10 months ago

Can you give it a go? I tried the change below earlier:

diff --git a/pylib/embedding_helper.py b/pylib/embedding_helper.py
index 620842d..0df8c3c 100644
--- a/pylib/embedding_helper.py
+++ b/pylib/embedding_helper.py
@@ -72,14 +72,7 @@ INSERT INTO {table_name} (
     title,
     page_numbers,
     tags
-) VALUES (
-    '{embedding}',
-    '{content}',
-    '{permission}',
-    '{title}',
-    '{page_numbers}',
-    '{tags}'
-);\
+) VALUES ($1, $2, $3, $4, $5, $6)
 '''

 SEARCH_DOC_TABLE = '''\
@@ -199,24 +192,14 @@ class PGvectorConnection:
             raise ConnectionError('Connection to database is closed')

         # Get the embedding of the content as a PGvector compatible list
-        content_embedding = list(self._embedding_model.encode(content))
+        content_embedding = self._embedding_model.encode(content)

         # Get the page numbers and tags as SQL arrays
         SQL_page_numbers = "{{{}}}".format(", ".join(map(str, page_numbers)))
         SQL_tags = "{{{}}}".format(", ".join(map(str, tags)))
+        await self.conn.execute(INSERT_DOC_TABLE.format(table_name=table_name), content_embedding.tolist(), content,
+                                permission, title, SQL_page_numbers, SQL_tags)

-        await self.conn.execute(
-            INSERT_DOC_TABLE.format(
-                table_name = table_name,
-                embedding = content_embedding,
-                content = content,
-                permission = permission,
-                title = title,
-                page_numbers = SQL_page_numbers,
-                tags = SQL_tags 
-                )
-            )
-        
     async def search_doc_table(
             self,
             table_name: str,

But still get:

File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/ogbujipt/embedding_helper.py", line 200, in insert_doc_table await self.conn.execute(INSERT_DOC_TABLE.format(table_name=table_name), contentembedding.tolist(), content, File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 319, in execute , status, _ = await self._execute( File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 1659, in execute result, = await self.execute( File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 1684, in execute return await self._do_execute( File "/home/chimezie/miniconda3/envs/OgbujiPT-py-3.10/lib/python3.10/site-packages/asyncpg/connection.py", line 1731, in _do_execute result = await executor(stmt, None) File "asyncpg/protocol/protocol.pyx", line 183, in bind_execute File "asyncpg/protocol/prepared_stmt.pyx", line 197, in asyncpg.protocol.protocol.PreparedStatementState._encode_bind_msg asyncpg.exceptions.DataError: invalid input for query argument $1: [0.0005201358580961823, -0.0375559926033... (expected str, got list)

And

asyncpg.exceptions.DataError: invalid input for query argument $1: array([ 5.20135858e-04, -3.75559926e-02,... (expected str, got ndarray)

When I directly pass in the instance of numpy.ndarray directly output from encoding the text (using the 'all-MiniLM-L12-v2' embedded model ), even though this suggests that asyncpg should be able to handle lists as parameterized query arguments. I don't see a vector in the list of PostgreSQL types in that table, but the original code is able to insert embeddings of text without characters needing escaping, so I'm not sure at this point, and I have done a decent amount of digging.

chimezie commented 10 months ago

Installing pgvector-python and using its register_vector method seemed to do the trick. Once I registered the vector with the DB connection, I was able to call a parameterized execute method on the connection, passing in the arguments directly (including the nparray embedding of the text), removing the tag and page number escaping, etc.

I had to implement it separately and can post a diff here or put this solution in a separate branch with a test case.

chimezie commented 10 months ago

As a separate note/bug/feature request, PGvectorConnection should have an insert_doc_tables (plural) method that leverages the connection.executemany(..) to efficiently insert multiple embeddings at once, which I imagine will be the primary use case.

uogbuji commented 10 months ago

I had to implement it separately and can post a diff here or put this solution in a separate branch with a test case.

Hey, sorry, just sitting down to look at this now. Sure, could you create a new branch off this ticket? I actually I just did that. Could you do:

git fetch origin
git checkout 45-pgvector-insert

And commit those changes there? I'll review, and maybe add to the test cases. Thanks! Also, sounds like we have a new dependency? I think we might want now to have a requirements-pgvector.txt

uogbuji commented 10 months ago

Working on this, and I'm thinking of refactoring PGvectorConnection. ISTM it is mainly encapsulating 3 things: PG connection, embedding_model & table name. The latter is kinda submarined by being a required parameter for each call, and that was the first thing that I found itchy.

I think what I want is to rename PGvectorConnection to PGvectorHelper (reluctantly doing the HumpCase) be initialized with all three of these params, but as a convenience for the case where you *do** want to share an embedding & connection but have multiple table names, etc., have a new_helper() method which allows you to create the same setup, but with a different table name.

Just jotting down the idea for any feedback, esp. from @choccccy & @chimezie. If no objections I'll finish up in the morning.

chimezie commented 10 months ago

That would be a good idea.

chimezie commented 10 months ago

The changes look good to me. There was one issue with docDB.insert_docs() that I fixed, tested, and pushed to be4ce30. Seems to be working with that fix

kaischuygon commented 10 months ago

i added a job to .github/workflows/yaml that spins up a pgvector docker container that allows the testing suite to connect to a mock pgvector database. I added a simple mock test to test_pgvector.py that mirrors the actions in test_text_w_apostrophe.ipynb. It's failing currently, but other tests still pass as expected. I'm leaving test_PGv_embed_poem() alone for now, will take another look at it tomorrow.

kaischuygon commented 10 months ago

Closing now that @choccccy and I are done and the tests are passing with new changes.