OoriData / OgbujiPT

Client-side toolkit for using large language models, including where self-hosted
Apache License 2.0
101 stars 8 forks source link

Making the vector DB table metadata fields more consistent #79

Closed uogbuji closed 3 months ago

uogbuji commented 4 months ago

It's been annoying me for a while that each of the metadata fields for our vector DB are a bit different:

The ones in pgvector_data_doc use:

    tags TEXT[]                             -- tags associated with the chunk

And pgvector_message uses:

    metadata JSON                             -- additional metadata of the message

I think this second pattern is the one we should be using, so in effect it would just be an update to ogbujipt.embedding.pgvector_data_doc.

@choccccy & @olajiide pls create a branch & eventually PR off this ticket for this one. Obvs it will require changes to the test suite, as well.

Investigate Migrations

I think it may be time we start to publish migrations for our DB changes. What that means is that it's some sort of simple script an admin using our PGVector features can run after a DB schema has changed, in order to have semantically the same data, but in the updated schema.

I haven't done a ton of DB admin myself, bu a colleague of mine used to use the Alembic/SQLAlchemy migration facilities for this. SQLAlchemy is an object-relational mapping (ORM), but from what I understand, you don't actually need to use the ORM to use its migration features. I think it works at the raw SQL level.

Anyway, could you guys do some investigation/learning on this topic and see how we might want to approach migration scripts, e.g. from <=0.8.0 to 0.9.0 release of OgbujiPT?

If Alembic doesn't seem suitable, here are other standalone DB migration tools I found after some searching:

Great high level overview: "Get Your Database Under Version Control"

Handy article on low-level SQL migrations (don't think they ever published a pt 2)

choccccy commented 4 months ago

We are currently proceeding by just removing anything complicated related to tagss (search by tags) and assuming that if the user wants that sort of more advanced functionality, that they'll build it themselves by referencing the metadata

uogbuji commented 4 months ago

On reconsideration, maybe this change will be simple enough (and with a lean enough footprint) that we don't need anything more than a quick Python script to forward-only migrate? I still think we should start to educate ourselves & prepare for how we'll handle this in future.

choccccy commented 4 months ago

right now testing and the demo notebook needs fixing

uogbuji commented 4 months ago

Discussion with @choccccy perhaps sheds a light on why we had this inconsistency in the first place. The more rigid structure (in the data & doc classes) allowed the conjunctive or subjunctive tag selection as part of query. I think the best way to restore this will be a library of callbacks to specialize the query by tags, for example, to match a top-level key of page 1:

page_one_filt = match_any('page', 1, cast='int')
result = await DB.search(text='Hello world!', meta_filter=page_one_filt)

Behind the scenes match_any would be used by the DataDB class to update the SQL query in injection-safe fashion. I think it adds a bit of syntactic bitterness in exchange for far more generalization/flexibility. As @choccccy already suggested, this would also eliminate the need for the DocDB class.

@choccccy I know this is tricky, and I'm happy to implement, if it makes sense. We don't need it straight out of the gates, so you can just simplify as we'd already discussed, and I'll come back and add the meta_filter system via a new PR.

Also, since the branch was created without a direct link to this ticket, I'll mention that https://github.com/OoriData/OgbujiPT/tree/normalize-metadata is meant to close this.

uogbuji commented 4 months ago

Recommend pulling in main now that #78 has landed.

choccccy commented 4 months ago

testing sorted as of commit 05db4b5, on to sorting out the jupyter notebook

probably acceptable to call this done enough for 0.9.0 release, and i can finish fixing the notebook soon™

choccccy commented 4 months ago

ok! i think that's everything. starting a PR

uogbuji commented 3 months ago

We probably also need to annotate some sort of version metadata on the tables we create. Should be able to use table comments, though we might also need to accept that if a DBA has chosen table comments of their own, we can't automatically do a version check for them.

uogbuji commented 3 months ago

Interesting, but probably not quite our use-case (more for a rigorous audit mindset): Bemi

uogbuji commented 3 months ago

Migrations bits moved to #82

uogbuji commented 3 months ago

OK, now that we have an example of basic metadata criteria check, I'll close this. As more use cases come to light I can expand the repertoire.