[Model Registry] planning

wetherc commented 1 year ago

cc @StephenODea54

I'm thinking to get the registry started it'll be

min.io for blob storage
postgres for db
Apollo server for resolvers and gql stuffs

I'm going to be lazy and maybe not include authn/authz quite yet? Obviously need to circle back to it. (maybe something like ory kratos? < https://github.com/ory/kratos > IDK if that's too heavyweight but I don't want to roll our own)

Your thoughts on the frontend bits and bobs? Feel free to start on the boilerplate setup for that stuff if you get bored.

I'll probably get stuff into Podman or Skaffold to spin up dev environments as we start stubbing out ⬆️. We probably don't want k8s as our default deployment target but it makes life easy enough with minikube for local development.

StephenODea54 commented 1 year ago

100% down for postgres and Apollo. I'm unfamiliar with the Kubernetes and authentication stuff, but I trust your judgement. I also agree that we should not roll out our own auth solution.

As for the frontend piece, I can definitely push the boilerplate pieces. Is there anything you're strongly against or is there anything you have a high preference for? I've been eyeing shadcn because it's been picking up a lot of traction in the UI space, but I'm fine with a more traditional component library like MUI.

Otherwise, I think that's good to start out with. The Apollo client will serve as the primary state management solution until further notice. When we get to that point, I really like zustand or jotai but am totally cool with a "tried and true" method like Redux.

wetherc commented 1 year ago

Kubernetes is a bear. For local dev, minikube is pretty easy though and I just personally am not in love with docker compose-ing everything.

Totally deferring to you on the frontend stuff. For component library, I'd just suggest picking something decently well-established and with clear mid- to long-term support over something really fresh. But you're more plugged in to that space than I have been for a while so your call there

Agree that once we get the point we need it jotai > redux

wetherc commented 1 year ago

Just want to give a bit more flavor text around what I'm planning for next steps. Broadly, I think order of operations is:

[X] Create a few pg databases when the deployment initially spins up. Probably can just hijack the docker-entrypoint.sh script or add it somewhere in the manifest. It's overengineered and 100% overkill, but it's not a pattern that has a ton of additional overhead and it scales gracefully. (See, e.g., Why does Phabricator need so many databases?) Basically though, I'm looking at these databases as a starting point:
- [X] dstk_metadata primarily for patch info and eventually for other odds and ends;
- [X] dstk_registry for registered model metadata;
- [x] dstk_deployments as a placeholder for whatever we decide to do about deployments eventually;
- [X] dstk_user for user account info;
- [ ] maaaaaaaaybe dstk_policy for user-defined access policies.
[X] Stub out some lightweight service, maybe as part of the Apollo server for now just for convenience, that is able to pull from dstk_metadata to figure out the current patchset for all table schemata and apply new patches if needed
[x] Stub out the initial tables for the model registry. I haven't given much thought yet to the schema of these, but high level maybe something in the neighborhood of
- [x] registry_storage_providers for blob storage deets
- [X] registry_models for the model metadata
- [x] registry_model_versions for the model version metadata
- [ ] registry_transactions for an auditable log of all CRUD operations performed
[ ] Build out the Apollo Server pieces to query the DBs (knex and Objection.js or what was that other one you linked me to a while back??) and expose the GQL methods for the frontend to use. Namely:
- [ ] createModel
- [ ] createModelVersion
- [ ] listModels
- [ ] listModelVersions
- [ ] createStorageProvider
- [ ] editStorageProvider
- [ ] deleteStorageProvider
- [ ] archiveModel
- [ ] archiveModelVersion

Any thoughts on all that? Make sense what I'm thinking? Any questions or junk I'm forgetting about? Also holler if there are any pieces of the above that you'd like to tackle

wetherc commented 1 year ago

also we'll get to play with federated Apollo graphs if you haven't had the chance to before: https://www.apollographql.com/docs/federation/

StephenODea54 commented 1 year ago

All sounds good to me! Couple of questions and comments:

Why the several different postgres databases? I'm not against, just genuinely curious! Is it because the data that these tables will hold are not easily relatable (in the relational db sense)?
I'm going to leave all of the docker/instance spinup stuff to you if that's cool. That stuff is still a little out of my wheelhouse and I'd like to tag along on that later.
I think the initial tables and Apollo server pieces look good!
The ORM I mentioned awhile ago was drizzle but that still hasn't hit v1.0 yet. I think we were going to go with kysely

Things I can start on right away:

I can start on drafting an ER diagram for the model_registry db.
Once the above is done, we can start on apollo server pieces.

wetherc commented 1 year ago

Why the several different postgres databases?

So for our use case, from a purely practical perspective, I have zero expectation that it will make a difference. Maybe the one tangible benefit I could point to is better data isolation across applications, but that's kind of tenuous. In the general sense, partitioning the data off in this way sets us up a little better if/when there's a legitimate use case to start sharding things. Again though, I doubt it'll ever hit that scale, but it costs nothing and might save a small headache for future us. For more on adventures in sharding, Notion has some good blog posts: Herding elephants and The Great Re-Shard

[Edit to add: this approach doesn't circumvent the eventual theoretical need for sharding, but it does substantially increase the runway we have before that becomes a legitimate concern. Given that different applications can have heterogenous usage patterns from a database perspective, it gives us much more flexibility to scale the different databases independently of one another versus having to throw a boatload of extra hardware at everything all at once because a single application's database is resource intensive. Instead, we can move that one offending db over to its own host and start scaling that infrastructure independently of the remaining, more lightweight dbs]

[Edit the second: From a user-facing perspective, it is also easier to develop against. Think of cases you might have run into in work contexts where application databases or data warehouses contained hundereds or >1000 tables. Logically partitioning tables into separate application databases keeps the number of tables anyone has to consider at any one time fairly manageable, even as the project continues to grow in complexity]

The ORM I mentioned awhile ago was [...]

hmmmm okay neither's at a 1.0 release yet. Might stick with objection since, even lacking active maintenance, it's still pretty solid for the time being. But let's keep an eye on both of those and start noodling over a migration plan ✨eventually✨

I can start on drafting an ER diagram

Sounds good! I have some thoughts that I'll probably add once you get a draft up and running, but it'll mostly be nit picky stuff

StephenODea54 commented 1 year ago

@wetherc So I see that we have the sql patches and tables ready for dtsk_registry (besides the transaction log) and dtsk_user. Do you want me to get started on the metadata and deployments pieces?

wetherc commented 1 year ago

Lol sorry I've been bored. I think the DB is good to go now for us to begin in on the GQL queries/mutations

StephenODea54 commented 10 months ago

Going to get started on creating the graphql schema for the registry_model_versions table

wetherc commented 10 months ago

I'm going to close this out in favor of more recent subtasks

dstk-labs / dstk

[Model Registry] planning #2