Open adelbertc opened 5 years ago
Participants: @adelbertc @drewgonzales360 @lu4nm3 @miranhpark @timperrett
CP: Nelson Control Plane DC: Logical datacenter - even if they are located in the same physical DC but use different schedulers, each of those schedulers are interpreted as a separate DC (perhaps 'Domain' would be a better phrase [#53] but this is the terminology Nelson currently uses.) Agents: The "data plane" component(s) that will pull messages from the Nelson Control Plane. It is assumed each individual scheduler/datacenter will have one agent in charge of it.
------
| CP |
------
Scheduler A / \ Scheduler B
---------------/-- --\---------------
| / | | \ |
| ---------- | | ----------- |
| |Agent A | | | | Agent B | |
| ---------- | | ----------- |
------------------ ------------------
A few questions that arise immediately for me:
Summary
Nelson was designed to be extensible in that all the high-level operations are abstracted away and the system as a whole is programmed against interfaces. That adding Kubernetes support just required adding interpreters for the scheduler and health checker is a testament to this fact.
However just having interpreters was not enough - we soon realized that because of the plethora and flexibility of different systems there would be no way for Nelson to be prescriptive about deployments. This led to the implementation of blueprints which have been used with great success and have solved numerous issues with regards to organizational choices without sacrificing Nelson's rigorous approach to deployment infrastructure.
We are now at another crossroad. While deployment blueprints get us a lot of flexibility along certain axes, it is not sufficient for full flexibility in deploying Nelson. To date the following issues have come up:
This RFC proposes to re-architect Nelson as a control plane, where instead of both "deciding what to do" and "actually doing it" Nelson becomes purely about "deciding what to do" and emits these as events, leaving the "actually doing it" to a data plane. This data plane would subscribe to events from Nelson and act on them accordingly and most importantly, be controlled by an organization. Different organizations with different policies would just have different data planes.
Design
Relevant initial Gitter discussion
Nelson is already built around an internal queuing model:
The idea then is to take the subset of these background jobs that constitute the "data plane" and instead of having both a producer and consumer inside Nelson, have only the producer and relegate the consumption to the downstream data plane.
The current thinking is these will stay in the control plane:
These will be relegated to the data plane:
For each of the data plane components, instead of being consumed by an implementation that actually acts on the event, it will instead be emitted to any subscribers listening on a network port. It is then on the subscriber to act on this information.
Implementation
Because deploying to a scheduler is the largest burden at the moment, the pipeline processor will be our first target. However because launching, deleting, and health checking are all scheduler-specific functionality, we cannot simply just migrate the pipeline processor, but also the cleanup pipeline, sweeper, and deployment monitor. The routing cron can likely be left as a separate step. Thus the migration order is:
As for how to emit events, current proposals are:
Implementation steps
Other notes
Testing: Rearchitecting Nelson into a control plane may also bring benefits to testing and demonstrations. Right now it has been hard to showcase Nelson because the control plane and data plane are tied together and thus require things like a functional Kubernetes cluster to startup. If instead Nelson just emitted events, we could say, have a dummy scheduler interpret those events for demo purposes, or even have something that interprets those events as D3.js actions where "deploy X" becomes rendering a node, service dependencies become edges, traffic shifting becomes moving edges, and garbage collection becomes deleting nodes.