Closed saterus closed 2 years ago
Indexing Entities in Postgres: As a general rule, we still want to rely on the Build SQLite databases for detail Entity Specs. In order to supply the UI with cross-cutting info, we'll need to maintain a few secondary indexes in Postgres.
Specifically:
Resource Resolution: Discussed a desire for UI & CLI file structure compatibility. This allows Catalogs created in the UI to be usefully represented in a CLI workflow. We'd like to avoid a user making a transition to a GitOps workflow sift through a 2000 line yaml file.
We need to resolve references on the client side (UI or CLI). The Catalog payloads will include related Resources as bundled ResourceDefs. This allows the UI to handle them separately and the CLI to manage them as individual files. This should avoid the user needing to sift through unnecessary duplication.
Brainstormed Questions that the Control Plane will need to be able to answer:
Build/Test/Activate:
Users/Permissions:
Events:
General Principal:
Creation/Editing Workflows:
Changesets:
Builds:
Diffs:
Schema Followup:
Status/Metrics:
We're going to pause on the everyday discussions for a minute. We've got 3 things we want to work on in the meantime:
We'll reconvene as these things progress.
From Slack: https://estuary-dev.slack.com/archives/C01G7CFNA8K/p1642544683030000
What does it mean to "delete" an Entity? How do I include that in a Build to be applied? I know I can edit a ShardSpec with delete: true, but what does it mean at the Flow-abstraction level?
Deletion is a separate "deactivation" rpc, the inverse of an activation.
If you deactivate a collection we'll remove its journals. If you deactivate a catalog task we'll remove its shards and their recovery logs, etc.
It's dangerous enough to be special, not represented as part of a Catalog or a Build.
To recap, I'm now thinking of deletions as an explicit DELETE /entity/:id endpoint, which from my end is very simple. We can add as many safeguards and checks as we like, but it isn't part of the build/activate workflow.
state: json
column that is freeform state information
Extended details can be found on #341.
GET /entity/:id
=> Response includes the Build ID currently "deployed"GET /entity/:id/build/:build_id
=> Response includes the full specification of the Entity as defined by the specific Buildimport
keyword implies a Strong Reference. Anything included by a Strong Reference will be included in the Build's list of Deployable Entities.Thank you @saterus , great write-up of the conversation!
Minor comment on GET /entity/:id/build/:build_id
: I'd pictured this as GET /build/:build_id/entity/:id/
-- that there would be various APIs which have the common context by a specific build ID, and extract information from it.
catalog_name
which is present in the main Catalog Namespace.
Ex. "accounts/alex", or "acmeCo/", or "acmeCo/anvilMaker"Exported Excalidraw File
Johnny, Dave, Phil, and I got together to chat about what separate streams of work we could start on for the control plane. It immediately went farther afield and started talking through the ramifications of grouping entities in the UI (akin to files with the CLI).
flowctl api discover
currently doesn't group "as files"
flowctl discover
, the higher level user-facing commandcontrol
should take the output of api discover
and group it "like files"
POST /sops/document
endpoint which returns an encrypted sops payloadPATCH /sops/document
endpoint which applies a JSON Patch to the contents of the documentPOST /builds
.sops
cliflowctl
will call back into the Control Plane to check access during the build process
HEAD /entity/:id
endpoint to check accessAs a bit of a followup to our previous discussion, I wanted to document the current behavior of flowctl discover
so we're all on the same page. We've been talking about the mechanisms to group things into files, but unless you've run flowctl discover
lately, it may seem a bit abstract.
This is going to be relevant as we're wanting to have the Control Plane's discovery endpoint return more than just the connector's raw output (which is what I'm returning today). We may or may not want to try and target identical output to what flowctl discover
does today, but it's at least a good starting point for the discussion.
Let's run through an example using source-postgres
on the Control Plane database.
I ran flowctl discover --image=ghcr.io/estuary/source-postgres:fb353df --prefix planetExpress
with a recent version of flowctl.
This created a config file template inside a new planetExpress
directory with the name source-postgres.flow.yaml
. I edited this file to point to my local control-plane-database (a bit recursive, but it's what I have handy).
Then I re-ran flowctl discover --image=ghcr.io/estuary/source-postgres:fb353df --prefix planetExpress
. This performs the discovery of the tables and outputs the following files:
$ tree planetExpress/
planetExpress
├── _sqlx_migrations.schema.yaml
├── connector_images.schema.yaml
├── connectors.schema.yaml
├── source-postgres.config.yaml
└── source-postgres.flow.yaml
It has a single top-level file for the Capture itself, source-postgres.flow.yaml
.
collections:
planetExpress/_sqlx_migrations:
schema: _sqlx_migrations.schema.yaml
key: [/version]
planetExpress/connector_images:
schema: connector_images.schema.yaml
key: [/id]
planetExpress/connectors:
schema: connectors.schema.yaml
key: [/id]
captures:
planetExpress/source-postgres:
endpoint:
connector:
image: ghcr.io/estuary/source-postgres:fb353df
config: source-postgres.config.yaml
bindings:
- resource:
namespace: public
stream: _sqlx_migrations
syncMode: incremental
target: planetExpress/_sqlx_migrations
- resource:
namespace: public
stream: connector_images
syncMode: incremental
target: planetExpress/connector_images
- resource:
namespace: public
stream: connectors
syncMode: incremental
target: planetExpress/connectors
Exploring from the top, we have a Collection entry for each table it discovered. It uses the (prefix + table name) to generate a name for the Collection. It infers the Collection key from the table's primary key.
It generates an accompanying schema file for each Collection/discovered-table:
properties:
checksum:
contentEncoding: base64
type: string
description:
type: string
execution_time:
type: integer
installed_on:
format: date-time
type: string
success:
type: boolean
version:
type: integer
required:
- version
type: object
properties:
connector_id:
type: integer
created_at:
format: date-time
type: string
digest:
type: string
id:
type: integer
name:
type: string
tag:
type: string
updated_at:
format: date-time
type: string
required:
- id
type: object
properties:
created_at:
format: date-time
type: string
description:
type: string
id:
type: integer
maintainer:
type: string
name:
type: string
type:
type: string
updated_at:
format: date-time
type: string
required:
- id
type: object
Next, we have the Capture definition itself. It references the config file, but this config is not yet sops encrypted (referencing it after encryption works the same way though).
connectionURI: postgres://flow:flow@localhost:5432/control_development
# Connection parameters, as a libpq-compatible connection string
# [string] (required)
max_lifespan_seconds: 0
# When nonzero, imposes a maximum runtime after which to unconditionally shut down
# [number]
poll_timeout_seconds: 10
# When tail=false, controls how long to sit idle before shutting down
# [number]
publication_name: flow_publication
# The name of the PostgreSQL publication to replicate from
# [string]
slot_name: flow_slot
# The name of the PostgreSQL replication slot to replicate from
# [string]
watermarks_table: public.flow_watermarks
# The name of the table used for watermark writes during backfills
# [string]
One point I'd note from all of this, in contrast to our previous discussion, is that there just isn't that much "grouping" going on. Mostly these entities are placed in their own files and referenced where necessary.
We still want the Control Plane to call flowctl api discover
, which currently only returns the raw "bindings" output from the connector. The wrapper flowctl discover
is what takes these bindings, along with the config and Capture metadata, to craft these files.
I think we want to craft response payloads that could be used by the UI or the CLI:
flowctl discover
command through the discovery endpoint and using the response to create files.In either case, it seems like we can model the response payload off the current output of flowctl discover
. We'll want to change it some due to the format (we can't use comments on fields), but I don't foresee any major changes. I think this transformation from "bindings" to specs/schemas is probably best for the Control Plane to handle, to avoid needing to write this on both the CLI and UI side. Our goal would be to return top level "file-spec/schemas" that could be easily turned into a Catalog spec for use with the upcoming build endpoint.
Control Plane / UX Meeting:
We talked a lot about the desired workflow for registration and login. We want to avoid doing extra work for the local login, as we know we want to use OpenID Connect in production. Ideally, the only difference will be in the this workflow and all subsequent requests will act identically.
We ended up working up some sequence diagrams to describe these flows.
As you see, the local login flow is just an abbreviated IDP login. There is no password for local logins, it is completely insecure. The user simply provides a account name they wish to login as. This gets passed as the auth_token
and we use this to find_or_create an Account and Credential for them.
In case it's useful for others, I've included the source for generating the diagrams.
This is a big system we're working on and there are a lot specific bits we need to work out yet. We've been having regular design discussions to hash out the details, but there's quite a breadth of topics. Some of them will get their own dedicated Issue (like #341) if they warrant a lot of extended discussion.
I'm going to post the notes from our discussions here as a way to keep everyone else in the loop.