adelbertc commented 5 years ago

Summary

Nelson was designed to be extensible in that all the high-level operations are abstracted away and the system as a whole is programmed against interfaces. That adding Kubernetes support just required adding interpreters for the scheduler and health checker is a testament to this fact.

However just having interpreters was not enough - we soon realized that because of the plethora and flexibility of different systems there would be no way for Nelson to be prescriptive about deployments. This led to the implementation of blueprints which have been used with great success and have solved numerous issues with regards to organizational choices without sacrificing Nelson's rigorous approach to deployment infrastructure.

We are now at another crossroad. While deployment blueprints get us a lot of flexibility along certain axes, it is not sufficient for full flexibility in deploying Nelson. To date the following issues have come up:

Deployment blueprints still require an interpreter be written for a particular scheduler - if the scheduler is not open-sourced and/or if support hasn't been added to Nelson itself for whatever reason, the only way to integrate with Nelson is maintaining a fork.
Provisioning a load balancer suffers from the same problems as deployments and itself does not have a blueprint-like system. Even just considering how Kubernetes Ingresses work, every organization is going to have different requirements in the spec so being prescriptive here is impossible.
The same can be said for the routing layer of Nelson.

This RFC proposes to re-architect Nelson as a control plane, where instead of both "deciding what to do" and "actually doing it" Nelson becomes purely about "deciding what to do" and emits these as events, leaving the "actually doing it" to a data plane. This data plane would subscribe to events from Nelson and act on them accordingly and most importantly, be controlled by an organization. Different organizations with different policies would just have different data planes.

Design

Relevant initial Gitter discussion

Nelson is already built around an internal queuing model:

runBackgroundJob("auditor", cfg.auditor.process(cfg.storage)(cfg.pools.defaultExecutor))
runBackgroundJob("pipeline_processor", Stream.eval(Pipeline.task(cfg)(Pipeline.sinks.runAction(cfg))))
runBackgroundJob("workflow_logger", cfg.workflowLogger.process)
runBackgroundJob("routing_cron", routing.cron.consulRefresh(cfg) to Http4sConsul.consulSink)
runBackgroundJob("cleanup_pipeline", cleanup.CleanupCron.pipeline(cfg)(cfg.pools.defaultExecutor))
runBackgroundJob("sweeper", cleanup.Sweeper.process(cfg))
runBackgroundJob("deployment_monitor", DeploymentMonitor.loop(cfg))

The idea then is to take the subset of these background jobs that constitute the "data plane" and instead of having both a producer and consumer inside Nelson, have only the producer and relegate the consumption to the downstream data plane.

The current thinking is these will stay in the control plane:

auditor
the "what to deploy" part of the pipeline processor (webhooks, parsing manifest files, etc.)
workflow logger
the "mark" part of the cleanup pipeline

These will be relegated to the data plane:

the "how to deploy" part of the pipeline processor
routing cron
the "sweep" part of the cleanup pipeline
sweeper
deployment monitor

For each of the data plane components, instead of being consumed by an implementation that actually acts on the event, it will instead be emitted to any subscribers listening on a network port. It is then on the subscriber to act on this information.

Implementation

Because deploying to a scheduler is the largest burden at the moment, the pipeline processor will be our first target. However because launching, deleting, and health checking are all scheduler-specific functionality, we cannot simply just migrate the pipeline processor, but also the cleanup pipeline, sweeper, and deployment monitor. The routing cron can likely be left as a separate step. Thus the migration order is:

pipeline processor + cleanup pipeline + sweeper + deployment monitor
routing cron

As for how to emit events, current proposals are:

unidirectional gRPC stream

Implementation steps

Split the pipeline processor and cleanup pipeline into their distinctive control plane/data plane parts - e.g. "what to deploy" vs. "how to deploy" and "mark as garbage" vs. "sweep garbage"
Come up with the Protobuf data models for the events
Write a reference implementation of a data plane that mimics the status quo
Sink the pipeline processor, cleanup pipeline, sweeper, and deployment monitor into a network port
Migrate the routing cron

Other notes

Testing: Rearchitecting Nelson into a control plane may also bring benefits to testing and demonstrations. Right now it has been hard to showcase Nelson because the control plane and data plane are tied together and thus require things like a functional Kubernetes cluster to startup. If instead Nelson just emitted events, we could say, have a dummy scheduler interpret those events for demo purposes, or even have something that interprets those events as D3.js actions where "deploy X" becomes rendering a node, service dependencies become edges, traffic shifting becomes moving edges, and garbage collection becomes deleting nodes.

adelbertc commented 5 years ago

Meeting notes 2019-08-13

Participants: @adelbertc @drewgonzales360 @lu4nm3 @miranhpark @timperrett

Terminology

CP: Nelson Control Plane DC: Logical datacenter - even if they are located in the same physical DC but use different schedulers, each of those schedulers are interpreted as a separate DC (perhaps 'Domain' would be a better phrase [#53] but this is the terminology Nelson currently uses.) Agents: The "data plane" component(s) that will pull messages from the Nelson Control Plane. It is assumed each individual scheduler/datacenter will have one agent in charge of it.

Notes

                 ------
                 | CP |
                 ------
Scheduler A     /      \   Scheduler B
---------------/--    --\---------------
|             /  |    |  \             |
|    ----------  |    | -----------    |
|    |Agent A |  |    | | Agent B |    |
|    ----------  |    | -----------    |
------------------    ------------------

High-level flow

Agents connect to the CP and register themselves as the agent for a particular DC. This agent is assumed to be the singular agent in charge of that DC. If this is the first time Nelson has seen this DC it will provision an internal queue for events for that DC.
- If another agent later tries to register for the same DC the CP will optimistically assume a new Agent has taken over.
- This also means enabling a DC no longer requires a configuration change and bounce, it just needs an Agent to register itself.
When a Deployment request comes in, Nelson will figure out what to deploy and insert that event to the corresponding queue.
Concurrently, the routing cron and GC processes will periodically figure out what they need to do from the DB, but instead of actually reaching out to Consul or the scheduler, will insert these as events into the queue.
Periodically the agent is expected to reach out to the CP to ask for work.
- This is a conscious choice to make distributing work pull-based as opposed to push-based as a pull-based approach only requires scheduler security policies to require dial-out which is more common than the approach of also allowing dialing-in.
- The CP will batch and serialize events in the order they appear in the queue and respond. The agent is expected to work on and ACK these events in the correct order.
As the agent completes work it will send an ACK to the CP. As part of this ACK message it will also contain free-form diagnostic information associating with completing the work that can be used for debugging (e.g. displayed on the Nelson CLI).
Until an event is ACK'd by the Agent the CP will not remove the event from the queue.
- This has the effect that if for whatever reason the Agent does not ACK events quick enough (e.g. Agent goes down for a while, processing is slow), the queue will back up. The current model bounds the size of the queue and drops messages once the maximum limit is hit. A similar approach will be taken here, dropping oldest messages first.

Background details

Nelson currently provides runtime diagnostics by querying systems directly. With this control plane split Nelson CP is no longer querying systems directly but just emitting events into a queue. Therefore, in order to continue providing this information we will ask that Agents also snapshot a summary of their state of the world and send it during communication with the CP. The CP will then cache this information and use it to provide runtime diagnostics.
The CP will no longer need to know about specific service discovery systems like Consul. Instead Nelson will emit routing/Lighthouse information in the queues and operators are expected to provision these systems and connect Agents to them appropriately. Agents who want to use the Lighthouse protocol will consume these events and publish the information to the system themselves.
The CP will still keep the blueprint database like in the current model. The hydrated blueprint will be sent as part of the deployment event that is consumed by agents.
- This has the implication that Nelson manifests need only specify the blueprint and not the workflow since the workflow is implicit in the Agent.
The CP queues are not durable. In the worst case this means if a deployment event comes in and the CP goes down, a deployment can be lost. However this is the case today anyways.
- Routing and GC information is periodically derived from the DB and inserted as events into the queues as described above so these subsystems won't be affected.
CP will use a cert-based approach when authorizing Agents.

Open questions

Does the CP expect an explicit heartbeat every N seconds from an Agent? Or does an Agent just need to register itself once in the beginning and the CP will expect the Agent is "active" until another Agent registers itself (and thus becomes the new active Agent)?
We expect Agents to provide a snapshot summary of the runtime systems they manage - do we expect these summaries to be provided only when it asks for work from the CP, only when it ACKs work, or both?
Because Agents are now pushing diagnostics to the CP as opposed to the current model where it is closer to a pull behavior, there will be some staleness - is this OK? This is related to the answer to the second question above.

goedelsoup commented 5 years ago

A few questions that arise immediately for me:

The definition of an agent implies a cluster/LAN singleton. Can we better enforce this via a semaphore built in Consul or DynamoDB? Perhaps this should also be pluggable. I'd prefer the constraint of only one can run over advising to only ever run one.
What prevents queue durability? Can we not offer configurations backed by both an in-memory fs2 implementation as well as an SQS backed one?
In regards to the open questions - is there a straight-forward path to just use a known gossip protocol like Raft within the JVM? I think this could significantly narrow the conversation on what's acceptable and how to implement it.
I think the security model on the CP channel might need some more investigation. If we are sending hydrated blueprints, we are no longer potentially holding secrets in memory but also over the wire. This is a compliance nit, for sure.

getnelson / nelson

RFC: Nelson control plane #240