control: Design Catchall

This is a big system we're working on and there are a lot specific bits we need to work out yet. We've been having regular design discussions to hash out the details, but there's quite a breadth of topics. Some of them will get their own dedicated Issue (like #341) if they warrant a lot of extended discussion.

I'm going to post the notes from our discussions here as a way to keep everyone else in the loop.

2022-01-13:

Indexing Entities in Postgres: As a general rule, we still want to rely on the Build SQLite databases for detail Entity Specs. In order to supply the UI with cross-cutting info, we'll need to maintain a few secondary indexes in Postgres.

Specifically:

Index of (build, entity). This allows us to track all the Builds a particular Entity was included in and vice versa.
Index of (build, task, collection). This allows us to track which Collections were referenced by a Task at the time of a given Build.
Log of (build, entity, activated_at). This history of activations allows us to know which Build contains the currently active definition of an Entity.

Resource Resolution: Discussed a desire for UI & CLI file structure compatibility. This allows Catalogs created in the UI to be usefully represented in a CLI workflow. We'd like to avoid a user making a transition to a GitOps workflow sift through a 2000 line yaml file.

We need to resolve references on the client side (UI or CLI). The Catalog payloads will include related Resources as bundled ResourceDefs. This allows the UI to handle them separately and the CLI to manage them as individual files. This should avoid the user needing to sift through unnecessary duplication.

2022-01-14:

Brainstormed Questions that the Control Plane will need to be able to answer:

Which Build defines Collection XYZ?
Which version of Capture ABC is active?
Which Materializations read from Collection JKL?
What is the history of Builds that involved Entity X?
- Where it would be activated?
Which Tasks have used Connector:0xabcd?
- Security updates
- User gets the latest version whenever the shard launches
- Images can be pinned by sha explicitly
Which Builds were built by Flow version=x?
What are the definitions for all these Entities as a single Catalog?

Build/Test/Activate:

Activations are first class concepts, not part of a Build.
Tests are part of a Build. To re-run tests, resubmit the same Build contents.

Users/Permissions:

Users gain access to an Organization/Tenant through Roles, not inherently.
Language bikeshed: Organization vs Tenant vs Account vs ?

Events:

BuildEvents is an immutable log of actions taken on behalf of creating/testing a Build.
ActivationEvents is a similar log for actions taken to Activate a specific Build.
Builds endpoint just creates a Build record and another async task picks up the act of actually performing the build process.
- Treated as a convergence problem.
- Simple actor vs job queue seem equivalent for driving this.

2022-01-18:

General Principal:

When in doubt, model the system as it exists.
Apply shortcuts/refinements in later iterations.
No need to debate whether we hide concepts from users.

Creation/Editing Workflows:

Current forms are one big form with steppers for async loading
Editing a resource works as it does now
Saving a resource change adds it to the changeset
- "Add to Changeset" button rather than "Save"
Redirect to the changeset page

Changesets:

Unsaved modifications that have yet to be applied
For now: clientside only
Show details of each entity staged as changes
View yaml diffs
Create Build button -> POST -> redirect to build details

Builds:

Shows the events and logs
Lists the Entities
- Links to Build-scoped details pages
- Entity as defined at that specific Build
Action buttons
- Activate All
- Activate Selected

Diffs:

New state is whatever the user submitted
Prior state is a resource at a specific build/changeset
Current scope:
- Diff changeset against activated state
- Later we may decide to generate diffs historically

Schema Followup:

Leverage our own Postgres CDC to get history
No need to maintain fine grain versions of things everywhere
Mutability will simplify the schemas across the board

Status/Metrics:

Always query directly for status
Alerts/Notifications are out of scope for now

We're going to pause on the everyday discussions for a minute. We've got 3 things we want to work on in the meantime:

Schema modeling. Johnny and Alex are braindumping in preparation for comparing notes.
Idealized Mocks. Dave is working on mocks for the UI we hope to be working towards as we build out the full platform. This is our guiding light for where we want to be.
Mocks for Q1. Travis and Alex are going to work through mocks of what we're actually shooting for this quarter. This will let us discuss concretely what we're UI building immediately.

We'll reconvene as these things progress.

From Slack: https://estuary-dev.slack.com/archives/C01G7CFNA8K/p1642544683030000

What does it mean to "delete" an Entity? How do I include that in a Build to be applied? I know I can edit a ShardSpec with delete: true, but what does it mean at the Flow-abstraction level?

Deletion is a separate "deactivation" rpc, the inverse of an activation.

If you deactivate a collection we'll remove its journals. If you deactivate a catalog task we'll remove its shards and their recovery logs, etc.

It's dangerous enough to be special, not represented as part of a Catalog or a Build.

To recap, I'm now thinking of deletions as an explicit DELETE /entity/:id endpoint, which from my end is very simple. We can add as many safeguards and checks as we like, but it isn't part of the build/activate workflow.

2022-01-24: Schema Discussion

Primary Keys:

Let's avoid sequential integers
- Leaks usage of the platform
- Leaning towards 64bit non-sequential values
- https://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/

Prefixes & Validation:

Validating the flow entity name rules can be done in both the app and the database
Enforcing the prefixing rules is going to require a database constraint

Status:

State machines as enums in the database can be tricky to evolve over time
- New requirements require coordinated data migrations and code changes
- States are not always mutually exclusive
- States may have associated data that is not relevant to other states
Timestamp columns can be used to calculate state in the app
- This feels too specific
Going to try having a state: json column that is freeform state information
- On a table-by-table basis, we can add relevant metadata or store a full event log

Extended details can be found on #341.

2022-01-25:

Activation => Deployment

Let's swap terminology:
- Activation is not a familiar term
- Deployment of a Build is a familiar metaphor and how we're actually modeling things.

Which Database to Build into?

Phil proposed using the Control Plane database as the Build details target, rather than individual SQLite databases.
- This has the advantage of providing global entity definitions during Build resolution.
- The downside is that using the SQLite databases as self-contained artifacts for the Data Plane to use as well. We don't want to introduce further coupling between the CP and DP.

Resolving Entity References:

How can we make it easier to resolve references to Entities not included in the initial Build source? We need to fetch their definitions to complete Builds.
Endpoints:
- GET /entity/:id => Response includes the Build ID currently "deployed"
- GET /entity/:id/build/:build_id => Response includes the full specification of the Entity as defined by the specific Build

Complexity of Partial ~Activations~ Deployments

Deployments of only a select set of Entities in a Build appear to be a non-trivial source of complexity for users unfamiliar with our internals. This is exacerbated in the UX of the Control Plane UI.
Microservice Metaphor:
- Entities are microservices which can be deployed independently or in concert, all sharing the same revision number.
Removing the grouping mechanism of an ~Activation~ Deployment and only Deploying a single Entity at a time is not really desirable/feasible.

Deploy an atomic Build

Remove the after-the-Build nature of selecting Deployment units
A Build is created and Deployed as a whole unit
- This helps with the integrity of the Build/Test process as an accurate representation of what will happen upon Deploy
Referenced Entities as can be thought of as "Dependencies"
- These dependencies will not be Deployed along with the Build's primary Entities, but are pulled in for the Build/Validate/Test process.
Builds contain an explicit list of Entities they will attempt to Deploy.

Strong and Weak Imports

The existing import keyword implies a Strong Reference. Anything included by a Strong Reference will be included in the Build's list of Deployable Entities.
References resolved during the Control Plane's build process will be Weak References. These are dependencies only, and will not be Deployed as part of the Build.
- We'll need to update the Catalog protocol to include a notion of Weak References.
- We probably don't need to expose this as a user-facing concern.

Access Enforcement

A User's permissions will be checked at both Build and Deployment time.
At Build-time:
- All Strongly or Weakly Referenced Entity must be readable by that user.
At Deployment-time:
- All Strongly Referenced Entity must be deployable by that user.
Any permissions failure will halt that part of the Build/Deployment process.
- A user may be able to Build a catalog but not have the access to deploy it.
- If a Build is not wholly deployable, the entire Deployment will fail. No partial deployment.
- Someone with proper permissions will need to deploy it, or the Build will need to be scaled back to use Weak References instead.

Thank you @saterus , great write-up of the conversation!

Minor comment on GET /entity/:id/build/:build_id: I'd pictured this as GET /build/:build_id/entity/:id/ -- that there would be various APIs which have the common context by a specific build ID, and extract information from it.

2022-02-10: Access Control

Questions we want to be able to answer:

Can Alex create a Capture named acmeCo/anvilFactory?
Can Travis read Collection acmeCo/anvils data?
Can Johnny read Collection acmeCo/anvils metadata?
Can Phil create a Build involving Capture acmeCo/anvilFactory?
Can Kiahna deploy a change to Capture acmeCo/anvilFactory's spec?
Can Dave grant access to Materialization acmeCo/anvilDropper to read from acmeCo/anvils?
Can Jixiang revoke Dave's permissions related to acmeCo/anvilFactory?
Can Will remove everyone's (self-inclusive) permissions from acmeCo/*?
Can Capture acmeCo/anvilFactory write data to Collection acmeCo/anvils?
How can Alex grant Materialization UsRobotics/robotSupplies access to read from acmeCo/anvils?
What happens when Samantha removes Capture acmeCo/anvilFactory's write permissions to Collection acmeCo/anvils?
Can Phil download the sqlite artifact from Build XYZ?
Can Jixiang restrict Will from materializing data from Collection acmeCo/anvils?
Can Kiahna restrict Dave from deriving data from Collection acmeCo/anvils to Collection maliciousCo/stolenAnvils?
Can Alex execute a Shard Split on Capture acmeCo/anvilFactory?
Can Johnny read directly from Journal ops/acmeCo/logs/anvilFactory?
How does Travis create an auth token for a push webhook Capture?
How can Olivia mark a Collection as globally public?
Which Organizations Paul can update Billing Information for?
Who does Samantha email when an Organization's credit card charge is denied?
Upon signup, what does firstuser@acmeco.com have access to?
Upon signup, what does nexthire@acmeco.com have access to?
Who does an Alert/PageEvent go to?

Terminology:

Catalog Namespace: A set of resource names which live within the Data Plane and Control Plane. Ex. Tasks, Collections, Tests, (new) Accounts, etc.
Account: A human, group, or automated system which can act on resources or own resources. An account has a catalog_name which is present in the main Catalog Namespace. Ex. "accounts/alex", or "acmeCo/", or "acmeCo/anvilMaker"
Organization: An Account with special metadata, probably revolving around terms of service, and billing.
Capability: A verb which acts upon a resource within the system. Ex. "read", "write", "deploy capture".
Grant: The record of a Capability to act upon a resource prefix (grantor_prefix) given to another resource prefix (grantee_prefix). Ex. Grant "accounts/alex" the "write" Capability to "acmeCo/anvils"

Provisioning Accounts:

In the beginning, our Organization provisioning process will be manual site-admin intervention.
- User clicks on "Free Trial" => provision just a user Account
- User clicks on "Business Plan" => provision an Org Account as well

Unsettled Questions:

How to prevent a user from getting infinite free trial compute resources by creating additional Organizations?
- Each Account gets some amount of Free Credits. If users can create new Orgs, we need to prevent them from gaining additional free credits.
How should a catalog name for a user-Account be constructed? Email? User choice of username? Random id?
- Email is a privacy leak
- User choice potentially collides with Org names
Should Accounts and Organizations share the same namespace in the Catalog Namespace?
- Should Accounts be prefixed to avoid name squatting?
- GitHub doesn't do this and has quite a few examples of high profile problems.
What are the verbs we wish to start with for Capabilities?
- ReadWrite, Read, Write good enough for now?
- We can layer on more fine grain control later.
- Should we roll these up to Roles now?
How does arbitrary nesting of Grants work?
- Grants could live in the Catalog Namespace too, which controls how to give other users access to create grants.
If Control Plane-only concepts are sharing the Catalog Namespace with Data Plane Entities, what do we call things that are in this namespace?
- Entity had previously meant Data Plane centric concepts.

Diagrams

Exported Excalidraw File

access-control-v3

2022-02-15: Control Plane Design Discussion

Johnny, Dave, Phil, and I got together to chat about what separate streams of work we could start on for the control plane. It immediately went farther afield and started talking through the ramifications of grouping entities in the UI (akin to files with the CLI).

Builds

API Discover Output

flowctl api discover currently doesn't group "as files"
- This is the behavior of flowctl discover, the higher level user-facing command
control should take the output of api discover and group it "like files"
- This is essential for having an entity's spec reference another directly
Sops encrypt files before discovery
- Send back the encrypted file as a part of the payload
- separate POST /sops/document endpoint which returns an encrypted sops payload
- Someday: PATCH /sops/document endpoint which applies a JSON Patch to the contents of the document
- Clients will reference this encrypted.sops.yaml in the configured catalog spec and submits it along with the POST /builds.

encrypted-config

Catalog Entity Export

How does it work for a user to pull files from the Control Plane to a local git-ops centric workflow?
Export pulls full files out of a Build
- Extra permissions check to make sure you can read everything
- Fully resolved referenced entities
- eg. collection includes its schemas in the original files.
Could we do minimal work serverside that would allow an api client to "crawl" references themselves?

Builds Root Service

Fetch/save SQLite databases from GCS
Someday: cache & cleanup
Inspecting the contents of builds is a separate concern we'll need.
- Potentially reuse the existing code which loads tables from SQLite build dbs.

Sops Service

Neatly wraps up interactions with the sops cli
Determines which fields of a connector spec should be encrypted (secret annotation)
Environment configurable KMS keyring
- Out of scope for now: Interacting with non-Estuary keyrings

Foreign Entity Resolution

Only resolving collection specs
Out of scope: tests, etc.
flowctl will call back into the Control Plane to check access during the build process
- Add HEAD /entity/:id endpoint to check access

Expanded Discovery Response - Followup

As a bit of a followup to our previous discussion, I wanted to document the current behavior of flowctl discover so we're all on the same page. We've been talking about the mechanisms to group things into files, but unless you've run flowctl discover lately, it may seem a bit abstract.

This is going to be relevant as we're wanting to have the Control Plane's discovery endpoint return more than just the connector's raw output (which is what I'm returning today). We may or may not want to try and target identical output to what flowctl discover does today, but it's at least a good starting point for the discussion.

Let's run through an example using source-postgres on the Control Plane database.

Example

I ran flowctl discover --image=ghcr.io/estuary/source-postgres:fb353df --prefix planetExpress with a recent version of flowctl.

This created a config file template inside a new planetExpress directory with the name source-postgres.flow.yaml. I edited this file to point to my local control-plane-database (a bit recursive, but it's what I have handy).

Then I re-ran flowctl discover --image=ghcr.io/estuary/source-postgres:fb353df --prefix planetExpress. This performs the discovery of the tables and outputs the following files:

$ tree planetExpress/
planetExpress
├── _sqlx_migrations.schema.yaml
├── connector_images.schema.yaml
├── connectors.schema.yaml
├── source-postgres.config.yaml
└── source-postgres.flow.yaml

Breaking these down a bit:

It has a single top-level file for the Capture itself, source-postgres.flow.yaml.

collections:
  planetExpress/_sqlx_migrations:
    schema: _sqlx_migrations.schema.yaml
    key: [/version]
  planetExpress/connector_images:
    schema: connector_images.schema.yaml
    key: [/id]
  planetExpress/connectors:
    schema: connectors.schema.yaml
    key: [/id]
captures:
  planetExpress/source-postgres:
    endpoint:
      connector:
        image: ghcr.io/estuary/source-postgres:fb353df
        config: source-postgres.config.yaml
    bindings:
      - resource:
          namespace: public
          stream: _sqlx_migrations
          syncMode: incremental
        target: planetExpress/_sqlx_migrations
      - resource:
          namespace: public
          stream: connector_images
          syncMode: incremental
        target: planetExpress/connector_images
      - resource:
          namespace: public
          stream: connectors
          syncMode: incremental
        target: planetExpress/connectors

Exploring from the top, we have a Collection entry for each table it discovered. It uses the (prefix + table name) to generate a name for the Collection. It infers the Collection key from the table's primary key.

It generates an accompanying schema file for each Collection/discovered-table:

_sqlx_migrations.schema.yaml

properties:
  checksum:
    contentEncoding: base64
    type: string
  description:
    type: string
  execution_time:
    type: integer
  installed_on:
    format: date-time
    type: string
  success:
    type: boolean
  version:
    type: integer
required:
  - version
type: object

connector_images.schema.yaml

properties:
  connector_id:
    type: integer
  created_at:
    format: date-time
    type: string
  digest:
    type: string
  id:
    type: integer
  name:
    type: string
  tag:
    type: string
  updated_at:
    format: date-time
    type: string
required:
  - id
type: object

connectors.schema.yaml

properties:
  created_at:
    format: date-time
    type: string
  description:
    type: string
  id:
    type: integer
  maintainer:
    type: string
  name:
    type: string
  type:
    type: string
  updated_at:
    format: date-time
    type: string
required:
  - id
type: object

Next, we have the Capture definition itself. It references the config file, but this config is not yet sops encrypted (referencing it after encryption works the same way though).

source-postgres.config.yaml

connectionURI: postgres://flow:flow@localhost:5432/control_development
# Connection parameters, as a libpq-compatible connection string
# [string] (required)

max_lifespan_seconds: 0
# When nonzero, imposes a maximum runtime after which to unconditionally shut down
# [number]

poll_timeout_seconds: 10
# When tail=false, controls how long to sit idle before shutting down
# [number]

publication_name: flow_publication
# The name of the PostgreSQL publication to replicate from
# [string]

slot_name: flow_slot
# The name of the PostgreSQL replication slot to replicate from
# [string]

watermarks_table: public.flow_watermarks
# The name of the table used for watermark writes during backfills
# [string]

One point I'd note from all of this, in contrast to our previous discussion, is that there just isn't that much "grouping" going on. Mostly these entities are placed in their own files and referenced where necessary.

Control Plane Discovery

We still want the Control Plane to call flowctl api discover, which currently only returns the raw "bindings" output from the connector. The wrapper flowctl discover is what takes these bindings, along with the config and Capture metadata, to craft these files.

I think we want to craft response payloads that could be used by the UI or the CLI:

The CLI will eventually route it's flowctl discover command through the discovery endpoint and using the response to create files.
The UI can present the response as "files", or not, depending on how the evolution of the UX goes.

In either case, it seems like we can model the response payload off the current output of flowctl discover. We'll want to change it some due to the format (we can't use comments on fields), but I don't foresee any major changes. I think this transformation from "bindings" to specs/schemas is probably best for the Control Plane to handle, to avoid needing to write this on both the CLI and UI side. Our goal would be to return top level "file-spec/schemas" that could be easily turned into a Catalog spec for use with the upcoming build endpoint.

Control Plane / UX Meeting:

We talked a lot about the desired workflow for registration and login. We want to avoid doing extra work for the local login, as we know we want to use OpenID Connect in production. Ideally, the only difference will be in the this workflow and all subsequent requests will act identically.

We ended up working up some sequence diagrams to describe these flows.

OpenID Registration

openid-registration-workflow

OpenID Login

openid-login-workflow

Local Login

As you see, the local login flow is just an abbreviated IDP login. There is no password for local logins, it is completely insecure. The user simply provides a account name they wish to login as. This gets passed as the auth_token and we use this to find_or_create an Account and Credential for them.

local-login-workflow

Sequence Diagram Source

In case it's useful for others, I've included the source for generating the diagrams.

Sequence Diagram Source

###### OpenID Registration Workflow ```sequence participant IDP participant User participant UI participant API participant DB // Authenticate with IDP User->UI: Load Page UI->User: Redirect to /login User->UI: Click Login w/ Provider UI->User: Redirect to /provider/login User->IDP: Provides username/password/2fa IDP->UI: Redirect to callback uri // Login Fail (no account) UI->API: POST /sessions/:issuer w/ auth_code API->IDP: POST /token IDP->API: id_token { issuer, subject, ... } API->DB: Find Credential by (issuer, subject) DB->API: None API->UI: Redirect to /registration UI->User: Registration Form User->UI: Fill in Account's catalog name, TOS, etc. // Register UI->API: POST /registration (w/ auth_code) API->IDP: POST /token IDP->API: Full jwt API->IDP: GET /userinfo IDP->API: { email, name, ...,} API->DB: Create Account DB->API: Account { email, name, ... } API->DB: Create Credential DB->API: Credential { account_id, issuer, subject, ... } // Login API->UI: Signed Session Token UI->UI: Add Signed Session Token to local storage UI->User: Logged In! ``` ###### OpenID Login Workflow ```sequence participant IDP participant User participant UI participant API participant DB // Authenticate with IDP User->UI: Load Page UI->User: Redirect to /login User->UI: Click Login w/ Provider UI->User: Redirect to /provider/login User->IDP: Provides username/password/2fa IDP->UI: Redirect to callback uri // Login Success UI->API: POST /sessions/:issuer w/ auth_code API->IDP: POST /token IDP->API: id_token { issuer, subject, ... } API->DB: Find Credential by (issuer, subject) DB->API: Credential { account_id, ... } API->DB: Find Account by id DB->API: Account { ... } // Login API->UI: Signed Session Token UI->UI: Add Signed Session Token to local storage UI->User: Logged In! ``` ###### Local Workflow ```sequence participant IDP participant User participant UI participant API participant DB // Authenticate with IDP User->UI: Load Page UI->User: Redirect to /login User->UI: Click Login w/ Local Provider UI->User: Redirect to /provider/login User->IDP: Provides username IDP->UI: Redirect to callback uri // Login Success UI->API: POST /sessions/:issuer w/ auth_code API->DB: Find Account by name DB->API: Account { ... } API->DB: Find Credential by account_id DB->API: Credential { account_id, ... } // Login API->UI: Signed Session Token UI->UI: Add Signed Session Token to local storage UI->User: Logged In! ```

estuary / flow