OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
102 stars 8 forks source link

The PGVector database search used with query_tags is not working #55

Closed chimezie closed 11 months ago

chimezie commented 1 year ago

The query_tags keyword argument to the .search([..]) function isn't intuitive to use. It is an array of TEXT fields. However, if you try:

await pacerDB.search(".. some search string ..", query_tags='some_tag')`

You get:

UndefinedColumnError: column "some_tag" does not exist

Looking at the table structure and the way the SELECT query WHERE clause is put together, it looks like this keyword argument is directly used as a boolean-evaluating expression, which is not obvious from the documentation and seems unsafe to be directly executed in the SQL query. The documentation of that parameter suggests it should just be a list of tags and the function should construct the appropriate WHERE clause, using any of the filter expressions described here, and taking the opportunity to safely incorporate it into the conjunctive expression in the executed SQL query.

So, I would expect to be able to do something like

await pacerDB.search(".. some search string ..", query_tags=['tag1', 'tag2', 'tagN'])`

chimezie commented 12 months ago

FYI. I was able to implement disjunctive semantics for comparing against an array of text (i.e., the query_tags argument w/ conjunctive = False) via this syntax:

WHERE 'tag 1' = ANY (tags) OR 
      'tag 2' = ANY (tags) OR 
      'tag 3' = ANY (tags)

This example is for 3 tags ('tag 1', 'tag 2', and 'tag 3'), but it works for an arbitrary number of tags where each comparison is separated by OR

uogbuji commented 12 months ago

@chimezie just checking for context. Osi pushed a test case in #57 this morning which already handles disjunctive semantics (he says—I haven't had a chance to review yet). Is your comment pursuant to seeing his changes, and are you proposing something different? I just asked him & he was also unclear, so I thought I'd check.

chimezie commented 12 months ago

Ahh.. The comment was without having seen that implementation. I'll swap it in.

uogbuji commented 12 months ago

Ahh.. The comment was without having seen that implementation. I'll swap it in.

Cool! Yep, Osi says the PR is definitely ready for our review, so if you have a moment, pls go ahead. 😊

chimezie commented 12 months ago

Ahh.. The comment was without having seen that implementation. I'll swap it in.

Cool! Yep, Osi says the PR is definitely ready for our review, so if you have a moment, pls go ahead. 😊

Ok. I just pushed a test case for the disjunctive/OR semantics that is failing (provides more than one tag for disjunctive evaluation).

uogbuji commented 12 months ago

This gave me a chance to review the pgvector test setup, so that's good.

Disjunctive eval is working fine. The reason why your added test isn't working, @chimezie is that you didn't specify limit, and the default limit for the search method is 1. After the following, test passes:

-    sim_search = await vDB.search(query_string='Text', query_tags=['tag1', 'tag3'], conjunctive=False)
+    sim_search = await vDB.search(query_string='Text', query_tags=['tag1', 'tag3'], conjunctive=False, limit=1000)

So the question I have is why we're setting a default limit of 1. @choccccy do you happen to know? Any reason not to make the default unlimited?

chimezie commented 12 months ago

Disjunctive eval is working fine. The reason why your added test isn't working, @chimezie is that you didn't specify limit, and the default limit for the search method is 1. After the following, test passes:

Got it. Sounds good

choccccy commented 11 months ago

So the question I have is why we're setting a default limit of 1. @choccccy do you happen to know? Any reason not to make the default unlimited?

I set a one limit since unlimited might return massive lists of records, so one seemed to be the least arbitrary number to limit that possibility. I believe that the fetched list of records is now returned as a generator, so upping that limit (or completely unlimiting it) should be fine now.

choccccy commented 11 months ago

tag-filtered search should be correctly implemented as of the merge from Pgvector bigdata insert and search and released in version 0.7.0

uogbuji commented 11 months ago

So the question I have is why we're setting a default limit of 1. @choccccy do you happen to know? Any reason not to make the default unlimited?

I set a one limit since unlimited might return massive lists of records, so one seemed to be the least arbitrary number to limit that possibility. I believe that the fetched list of records is now returned as a generator, so upping that limit (or completely unlimiting it) should be fine now.

Yes. I think we dovetailed in the end.