estuary / flow

🌊 Continuously synchronize the systems where your data lives, to the systems where you _want_ it to live, with Estuary Flow. 🌊
https://estuary.dev
Other
580 stars 47 forks source link

design RFC: multi-tenant data planes and catalog applies #104

Closed jgraettinger closed 3 years ago

jgraettinger commented 3 years ago

Problem

We want Flow data-planes to support multiple built catalogs, teams, and even customers simultaneously, using a shared pool of compute resources. One of the problems this raises is: how should we model distinct catalogs within a data-plane ? How do we prevent collisions over named entities? How can they communicate (e.x. if two teams share data)? How do we model authorization over what resources a current user or role may access?

Constraints / Observations

The organization of catalog source files isn't meaningful. We allow a lot of flexibility in how sources may reference and import other sources, and/or be organized. Users are free to refactor sources at any time, or import sources authored by other teams. We don't want to lose that flexibility, and cannot imbue meaning from the particular source file that an entity is defined in. After source loading / validation, catalog builds produce flat entity namespaces, and any solution must start from this point.

Collections are fully-qualified names. We require that collections are fully qualified wherever they're defined or referenced. This choice stems from the opinion that it's better to be a explicit and unambiguous for all readers, at the expense of a bit more verbosity. It also makes refactors of catalog sources safe and easy. While we could add a "catalog name" or some-such to help identify and organize catalogs, this introduces a further effective namespace in front of the catalog's collections. We don't want to require further qualification of an already qualified name.

We desire a global collection namespace. This makes it trivially easy for collections of one team (or company!) to address the collections of another, much like any company can address the S3 objects of another (and like S3, a capability to address is not a capability to read).

Collections may move between teams over time. As team structure and membership changes, and those changes reflect into git repos and Flow projects, collections may move from one team / project to another. This should be supported seamlessly.

Names relate collections <=> journals <=> S3 files. A collection name implies the names of journals holding logical and physical partitions, and from there the paths of files which are persisted into cloud storage buckets. This means that a users can work backwards from any fragment file (or mounted external table in e.x. snowflake) to the collection which owns it -- a handy property. We are not targeting a capability to re-name or alias collections -- names are permanent. This in part falls out of the constraint that e.x. S3 doesn't allow moves, but even if it could (and Gazette has some mechanisms to allow for aliases), it still doesn't seem like a good idea because it introduces ambiguity in where data came from. Our position should be that renames are solved via copying derivations.

Collections already have organizing path components. E.x. a good collection names might be acmeco/fulfillment/orders, acmeCo/fulfillment/shipments, and acmeCo/marketing/prospects. These components can qualify the company and teams (fullfilment vs marketing, perhaps) that own data.

Directory-based authorization is well-trod and intuitive. While there are no actual directories involved here, authorization rules could be modeled in terms of longest-matched path component prefixes. My user or role is authorized to write collections under acmeCo/fullfilment, and read (but not write) acmeCo/marketing, and I have no access at all to acmeCo/financials with the exception of reads under acmeCo/financials/public.

Flow entities are built into specifications that live in Etcd. Etcd is the system-of-record for data-plane understanding of current entity specs. A built catalog is applied to update a subset of specs in a flat namespace. Some entities, like collections & derivations, naturally work within this namespace. Others -- like built NPM packages, journal rule sets, and schema bundles -- don't, and need some kind of identifying and unambiguous key.

jgraettinger commented 3 years ago

Initial Decisions

Applying catalogs

The Flow catalog build process produces comprehensive runtime specifications for Flow entities drawn from catalog sources. Today, the top-level entities are CollectionSpec, MaterializationSpec, DerivationSpec, and (soon) CaptureSpec [^1]. Each of these specs have a specific and fully-qualified name, making them globally unique and subject to path-based authorization.

Applying a built catalog is then fundamentally an operation of enumerating its specifications, authorizing the user's capability to update the specs in question, and then upserting those specifications into the data-plane's Etcd keyspace. For example, a CollectionSpec might be inserted or updated under Etcd key /flow/collections/AcmeCo/fulfillment/orders [^2].

[^1]: There are several others, such as ShuffleSpec and TransformSpec, but these are contained within top-level specs. [^2]: This skips over some details, such as creating initial shards for new DerivationSpecs / MaterializationSpecs, but these are straightforward and well-understood projections.

Catalog-scoped resources & apply UUIDs

Several additional resource types (JournalRules, SchemaBundles, and TypeScript npm packages) are scoped to the built catalog as a whole -- there is no ready unique name for them. This is resolved by having the catalog apply operation generate a new UUID4 under which these resources are named and placed in the Etcd keyspace (e.x. as /flow/journalRules/12345678-1234-5678-1234-567812345678). This UUID4 is in turn referenced from named specs which were upserted in the same catalog apply.

As entities are applied and re-applied, UUID4's which are no longer referenced will be garbage-collected from Etcd.

Referencing Imports

A remaining question is how a user may reference a collection, but not actually apply it from a given built catalog. For example, the user may be deriving a collection that sources from a collection managed by another team, company, or Flow project. That collection is defined by a file somewhere -- a relative local file in a git repo, or a remote URL.

At present, Flow has a single import directive which is used to source files/URLs and includes them in the entities produced as part of a built catalog. The problem is that imports don't distinguish between entities which are "part of" the current project -- and should be upserted by a catalog apply -- vs entities which are imported only to resolve their definitions.

We will address this by adding a new reference directive. reference behaves identically to import in almost every respect -- sources are recursively fetched, validated, and assembled into a flat set of built entities. However, if an entity is only reachable through reference -- there is no alternative import path (which takes precedence) -- then applications of the built catalog will check the entity for existence and read authorization, and will not attempt to upsert it.

jgraettinger commented 3 years ago

Conclusions

Answering questions posed in the problem statement:

how should we model distinct catalogs within a data-plane ?

We don't, at least not explicitly. We model fully-qualified entities only. As an implementation detail, these entities may in turn reference catalog-scoped resources such as SchemaBundles, but only through an opaque, unique, and garage-collected identifier.

How do we prevent collisions over named entities?

We re-interpret named entities to live within a fully qualified, global resource namespace. Collisions are not possible because they are by definition the same entity.

How can they communicate (e.x. if two teams share data)?

Team B may source from collections of team A by

  1. Importing team A's collection definitions into their catalog through a reference directive and
  2. Simply naming the desired team A collection.

This mechanism applies across teams and even companies that co-exist in the same data plane.

How do we model authorization over what resources a current user or role may access?

We'll utilize path component prefix authorization rules which articulate the permitted paths my account or role may read, write, or apply. As an organizational practice in a multi-tenant environment, we'll manage the allocation of leading prefixes of the namespace, and enable recipients to in-turn manage the sub-provisioning of their namespace.

Open Questions

Should import be renamed to include to further clarify its distinction from reference ? This seems like a good idea. Collectively include and reference directives would then be colloquially referred to as "imports".

How are entities, e.x. derivations, to be deleted ? We still need to define what "deletion" means -- deleting a Gazette JournalSpec doesn't remove its data, and re-application of a naively deleted entity could have surprising outcomes! Likely some form of disabling + soft delete ?

What role should labels play in organization and catalog applies? Gazette and Kubernetes both use labels and label selectors as foundational tools for organizing resources. We've discussed and desire to expose labels on Flow entities as well, and have noted label selectors provide a means of specifying a set of resources to be "replaced" by a catalog apply. Importantly, that gives the user a way to express deletion (as a resource matched by a selector, which is not upserted by an applied catalog).

psFried commented 3 years ago

Overall, I'm happy with how this is shaping up. Here's some thoughts/reactions:

I agree wholeheartedly with the idea to use a directory-based authorization model once we introduce authorization as a concept. I'm a little unclear as to the specific proposal here regarding authorization. I'd like to interpret this as suggesting a good model for authorization rules, but not a plan to actually implement or model any of the authorization rules themselves just yet. Does that interpretation seem right to you?

As described here, include and reference are two different mechanisms for sharing collections within a single data plane. A separate capability, which I think will also be important, is to share collections between separate data planes. I don't think we need to design all that now, but I think it's worth a little thought in case it would impact our current modeling. It seems possible that consuming data from a separate data plane might use yet another import-like keyword (e.g. remoteReference).

Catalog-scoped resources & apply UUIDs

I like the idea of using a UUID to uniquely identify each specific apply of a catalog. I wonder if this could be taken a step further to allow for somthing that can be used as a log of applies. It would be useful to be able to answer questions like "when was collection X last applied?" or "when was the last time any collection was applied in this cluster?". Put another way, is there a general modeling of a sequence of apply operations that also affords looking up journal rules, schema bundles, and npm packages? Taking it a step further, I wonder if such a modeling might also provide a better framework for dealing with deletions.

Should import be renamed to include to further clarify its distinction from reference ? This seems like a good idea.

Yeah, I like this idea a lot.

dyaffe commented 3 years ago

I like this a lot. A couple of thoughts:

  1. When a team uses reference to pull from another's collection, should there be a way to have the original team understand that they now have dependencies within the organization around the shape of that collection? Knowing who else uses something is a routine problem which would help with streamlining the process of evolving collections.
  2. You say that collections can "move between teams over time", but can a collection be owned by multiple teams? I assume the answer is yes based on the implementation but just making sure.
psFried commented 3 years ago

When a team uses reference to pull from another's collection, should there be a way to have the original team understand that they now have dependencies within the organization around the shape of that collection?

This brings up a great point. While reference gives teams a way to depend on a collection without actually applying it, it creates the possibility for things to get out of sync. Say Team A manages CollectionA, and TeamB wants to reference it. What happens when the spec of CollectionA that's currently in the cluster differs from the spec of CollectionA that TeamB references? What about if TeamA subsequently updates CollectionA to loosen the schema after TeamB has applied thier derivation that uses it?

I think these scenarios are solvable by having build/apply-time validations that account for the current state of the cluster. But these validations are trickier than what I was initially imagining. In order to account for collections managed by other teams, the validations must take into account collections from all teams whenever any team applies a collection. Otherwise it would be impossible to tell if some change to relax the schema would be incompatible with other derivations/materializations. Additionally, there may be some additional complexity needed to maintain correctness in the face of concurrent apply commands. This all starts to make reference seem a little less attractive to me. At this point, it might be worth considering an alternative for allowing multiple teams to share a single data plane.

A possible alternative

The existing semantics could technically support this, albeit with some significant differences. Without a reference directive, users from one team would not be able to directly apply thier collections to the shared cluster. You'd need one team that is responsible for the entire set of collections, and they'd have to import the yaml from all the other teams that are sharing the cluster, then apply them all at once. At first blush, this may seem kinda crappy, but this could be a legitimate case where the limitations of the current system are actually a feature, not a bug.

In order to allow multiple teams to independently apply their catalogs to a multi-tenant cluster, we could use a separate common service that sequences the apply operations from different teams by importing all the specs into a common root catalog and then applying that. An Estuary SAAS could provide such a service, and provides a natural place to implement authorization. But if someone wanted to, they could also just setup a shared git repo that imports the specs from other teams and apply everything from there.

jgraettinger commented 3 years ago

I'm a little unclear as to the specific proposal here regarding authorization. I'd like to interpret this as suggesting a good model for authorization rules, but not a plan to actually implement or model any of the authorization rules themselves just yet.

That's right. We're not building authorization right now, but do need a clear understanding of how the authZ model will work and compose into this design.

A separate capability, which I think will also be important, is to share collections between separate data planes.

I've given this thought, though I'm not 100% this is something we should do. In any case I don't see hang-ups, but didn't want to confuse this design further.

I wonder if this could be taken a step further to allow for somthing that can be used as a log of applies.

The limb I'll go out on right now is that, whatever metadata we do want to track/expose/query, the mechanism used should be Etcd + Labels + Selectors. Etcd, which provides snapshotted total enumerations of entities. Labels for metadata annotations. Selectors for queries. Knowing that, I'm content to not design further than we need right now.

When a team uses reference to pull from another's collection, should there be a way to have the original team understand that they now have dependencies within the organization around the shape of that collection.

As above, since we can obtain snapshots of the complete data-plane from Etcd -- annotated with labels -- I believe we can build any pane of glass we want here.

You say that collections can "move between teams over time", but can a collection be owned by multiple teams?

Yea; the collection name itself doesn't change, but the authZ rules, users, and roles would change to express these relationships. E.x. two roles representing two teams both have admin rights to a path prefix. Though I'm not sure I'd recommend that!

What happens when the spec of CollectionA that's currently in the cluster differs from the spec of CollectionA that TeamB references?

It's a runtime error. We're always checking schemas at read time -- derivations and materializations always verify each document against the current source schema in the associated catalog schema bundle. This situation isn't any different from, eg, a collection schema drifting too far over time vs already-written historical data, or a derivation using a source schema that's incorrect w.r.t. a less-restrictive source.

If your catalog sources are all accurate -- and the garden path is to use shared sources which assure this -- we can provide build-time guarantees. If they're not, we can still provide runtime guarantees and keep you from writing bad data.

jgraettinger commented 3 years ago

Put another way, we can't build a mechanism where team B can block team A from applying updates to collections that team A controls. They're team A's, they can do what they like, up to and including breaking B's stuff.

But we can provide tools to make it easy for A & B to cooperatively decide to not break things. For example, by running out of a common git repo and having CI builds test inter-op. But we can't presume there's any kind of relationship between teams A & B.

I'm also pretty sure it's not possible :) because it boils down to identifying the set theory relationships between the universe of possible documents accepted by two arbitrary JSON schemas. Not to mention breaking changes to semantics which don't bubble up into schema!

psFried commented 3 years ago

It's a runtime error. We're always checking schemas at read time

The specific concern here is that I don't think we can guarantee that this will be caught by schema validation, or that there will be a runtime error at all. You might simply get invalid data! I'll try to describe an example of what I mean here. Say we have:

I 100% agree that we cannot provably guarantee that an any relaxation of an upstream schema can't break a downstream derivation. But I do think we can and should prevent scenarios like the one I described. When you build and apply the entire catalog, the typescript compiler gives you a "pretty good" check that things will work as expected at runtime. If we allow derivations to read from collections that may be updated independently, then using reference basically constitutes an "opting out" of this type checking whenever the referent schema is updated. IMO, its worth taking a look at alternatives that allow us to maintain the same level of type checking you get from builds that don't use reference.

jgraettinger commented 3 years ago

In your example, TypeScript would require you to specify what happens if intProperty is undefined. It won't let you presume it's there, unless it's marked as required, in which case it would be a validation error. And if team A updates schema and sets an explicit null, that would also be a validation error of B's transform, which continues to run with (and verify) the schema as it existed at the time that B's collection was applied.

jgraettinger commented 3 years ago

This is maybe a helpful way to think about it: every derivation, under the hood, has a source schema that it's always verifying. As sugar, we do allow you to omit specifying it, but it's there and it's always tied to the catalog as it existed when the derivation was updated. Team A controls the schema of their collection, but B controls the source schema of B's derivation that reads it.

psFried commented 3 years ago

it's there and it's always tied to the catalog as it existed when the derivation was updated

lol I was literally just typing this up as a suggestion, unaware that it was already getting persisted in that way. Yeah, I see how that would work to at least guarantee that you get a schema validation error as opposed to invalid data.

That said, I'm still finding it difficult to convince myself that we really need or want reference. Now that I've thought through the alternative of what it might look like to integrate multiple teams without it, it really doesn't seem like it's buying us a whole lot, given that you'd be giving up at least some measure of type checking in the tradeoff. A service that sequences all catalog updates seems like it could still allow mutliple teams to collaborate effectively, while retaining the same level of type checking you get when you build/apply the entire catalog in one go.

Maybe another angle is to ask the question, "what should it look like when you want to make a potentially breaking change to a collection?" As it is, we'll have no automated way of knowing whether a schema change might break something before it's applied. You have to just apply the change and wait to see if anything stops working. If we copy the semantics of popular programming languages there, then dependencies would not be updated automatically, thus giving confidence that your pipeline will continue to run until you intentionally update it. Another thought is that perhaps we should think about modeling "versions" of collections in flow, in a way that allows derivations to treat v1 of a source collection separately from v2. Such an approach may allow for migrations between major versions without downtime or runtime errors.

jgraettinger commented 3 years ago

A service that sequences all catalog updates seems like it could still allow mutliple teams to collaborate effectively, while retaining the same level of type checking you get when you build/apply the entire catalog in one go

I don't know what "a service that sequences all catalog updates" means. Sticking with your programming metaphor, it sounds isomorphic to requiring that all tenants of a data-plane use a mono repo with a mono release process. Which can be a good strategy! An org might choose to use a single repo / CI pipeline / credentialed role doing CD catalog applies. Then, as you say, they don't need reference.

That breaks down fast as soon as different teams owning different collections want autonomy over, say, release cadence for their respective data products. Even within a monorepo / shared CI / common CD workflow I still need reference to articulate what is actually being updated by a given apply invocation.

It also breaks down if there are org restrictions over who can access what. Then I can't use shared CD because that's a channel for inadvertent privilege escalation.

And at the far end, if data-planes are to support unaffiliated tenants then we can presume no coordination between parties.

Another thought is that perhaps we should think about modeling "versions" of collections in flow, in a way that allows derivations to treat v1 of a source collection separately from v2.

You do this by creating a new collection / derivation with a v2 suffix. We should support collection labels for organization, but what else is needed ? Like Go modules, new major versions are fundamentally different collections which happen to be meaningfully related to humans only.


Zooming out again, we have strong mechanisms that allow for cooperative detection of breaking changes between teams. Mechanisms that already exceed anything practical we could build at this layer of the service. Have one joint, and keep it well oiled.

jgraettinger commented 3 years ago

Some decision records from continuing to think about this / work on implementation:

jgraettinger commented 3 years ago

Closing as the data-plane portions of this have been implemented, and we'll consider further control plane work as future and separate items.