dstk-labs / dstk

Data science toolkit for easy model registries, deployment, and monitoring
GNU General Public License v3.0
1 stars 1 forks source link

[Model Registry] planning #2

Closed wetherc closed 10 months ago

wetherc commented 1 year ago

cc @StephenODea54

I'm thinking to get the registry started it'll be

I'm going to be lazy and maybe not include authn/authz quite yet? Obviously need to circle back to it. (maybe something like ory kratos? < https://github.com/ory/kratos > IDK if that's too heavyweight but I don't want to roll our own)

Your thoughts on the frontend bits and bobs? Feel free to start on the boilerplate setup for that stuff if you get bored.

I'll probably get stuff into Podman or Skaffold to spin up dev environments as we start stubbing out ⬆️. We probably don't want k8s as our default deployment target but it makes life easy enough with minikube for local development.

StephenODea54 commented 1 year ago

100% down for postgres and Apollo. I'm unfamiliar with the Kubernetes and authentication stuff, but I trust your judgement. I also agree that we should not roll out our own auth solution.

As for the frontend piece, I can definitely push the boilerplate pieces. Is there anything you're strongly against or is there anything you have a high preference for? I've been eyeing shadcn because it's been picking up a lot of traction in the UI space, but I'm fine with a more traditional component library like MUI.

Otherwise, I think that's good to start out with. The Apollo client will serve as the primary state management solution until further notice. When we get to that point, I really like zustand or jotai but am totally cool with a "tried and true" method like Redux.

wetherc commented 1 year ago

Kubernetes is a bear. For local dev, minikube is pretty easy though and I just personally am not in love with docker compose-ing everything.

Totally deferring to you on the frontend stuff. For component library, I'd just suggest picking something decently well-established and with clear mid- to long-term support over something really fresh. But you're more plugged in to that space than I have been for a while so your call there

Agree that once we get the point we need it jotai > redux

wetherc commented 1 year ago

Just want to give a bit more flavor text around what I'm planning for next steps. Broadly, I think order of operations is:

Any thoughts on all that? Make sense what I'm thinking? Any questions or junk I'm forgetting about? Also holler if there are any pieces of the above that you'd like to tackle

wetherc commented 1 year ago

also we'll get to play with federated Apollo graphs if you haven't had the chance to before: https://www.apollographql.com/docs/federation/

StephenODea54 commented 1 year ago

All sounds good to me! Couple of questions and comments:

Things I can start on right away:

wetherc commented 1 year ago

Why the several different postgres databases?

So for our use case, from a purely practical perspective, I have zero expectation that it will make a difference. Maybe the one tangible benefit I could point to is better data isolation across applications, but that's kind of tenuous. In the general sense, partitioning the data off in this way sets us up a little better if/when there's a legitimate use case to start sharding things. Again though, I doubt it'll ever hit that scale, but it costs nothing and might save a small headache for future us. For more on adventures in sharding, Notion has some good blog posts: Herding elephants and The Great Re-Shard

[Edit to add: this approach doesn't circumvent the eventual theoretical need for sharding, but it does substantially increase the runway we have before that becomes a legitimate concern. Given that different applications can have heterogenous usage patterns from a database perspective, it gives us much more flexibility to scale the different databases independently of one another versus having to throw a boatload of extra hardware at everything all at once because a single application's database is resource intensive. Instead, we can move that one offending db over to its own host and start scaling that infrastructure independently of the remaining, more lightweight dbs]

[Edit the second: From a user-facing perspective, it is also easier to develop against. Think of cases you might have run into in work contexts where application databases or data warehouses contained hundereds or >1000 tables. Logically partitioning tables into separate application databases keeps the number of tables anyone has to consider at any one time fairly manageable, even as the project continues to grow in complexity]

The ORM I mentioned awhile ago was [...]

hmmmm okay neither's at a 1.0 release yet. Might stick with objection since, even lacking active maintenance, it's still pretty solid for the time being. But let's keep an eye on both of those and start noodling over a migration plan ✨eventually✨

I can start on drafting an ER diagram

Sounds good! I have some thoughts that I'll probably add once you get a draft up and running, but it'll mostly be nit picky stuff

StephenODea54 commented 1 year ago

@wetherc So I see that we have the sql patches and tables ready for dtsk_registry (besides the transaction log) and dtsk_user. Do you want me to get started on the metadata and deployments pieces?

wetherc commented 1 year ago

Lol sorry I've been bored. I think the DB is good to go now for us to begin in on the GQL queries/mutations

StephenODea54 commented 10 months ago

Going to get started on creating the graphql schema for the registry_model_versions table

wetherc commented 10 months ago

I'm going to close this out in favor of more recent subtasks