getnelson / nelson

Automated, multi-region container deployment
https://getnelson.io
Apache License 2.0
399 stars 40 forks source link

RFC: Nelson control plane #240

Open adelbertc opened 5 years ago

adelbertc commented 5 years ago

Summary

Nelson was designed to be extensible in that all the high-level operations are abstracted away and the system as a whole is programmed against interfaces. That adding Kubernetes support just required adding interpreters for the scheduler and health checker is a testament to this fact.

However just having interpreters was not enough - we soon realized that because of the plethora and flexibility of different systems there would be no way for Nelson to be prescriptive about deployments. This led to the implementation of blueprints which have been used with great success and have solved numerous issues with regards to organizational choices without sacrificing Nelson's rigorous approach to deployment infrastructure.

We are now at another crossroad. While deployment blueprints get us a lot of flexibility along certain axes, it is not sufficient for full flexibility in deploying Nelson. To date the following issues have come up:

This RFC proposes to re-architect Nelson as a control plane, where instead of both "deciding what to do" and "actually doing it" Nelson becomes purely about "deciding what to do" and emits these as events, leaving the "actually doing it" to a data plane. This data plane would subscribe to events from Nelson and act on them accordingly and most importantly, be controlled by an organization. Different organizations with different policies would just have different data planes.

Design

Relevant initial Gitter discussion

Nelson is already built around an internal queuing model:

runBackgroundJob("auditor", cfg.auditor.process(cfg.storage)(cfg.pools.defaultExecutor))
runBackgroundJob("pipeline_processor", Stream.eval(Pipeline.task(cfg)(Pipeline.sinks.runAction(cfg))))
runBackgroundJob("workflow_logger", cfg.workflowLogger.process)
runBackgroundJob("routing_cron", routing.cron.consulRefresh(cfg) to Http4sConsul.consulSink)
runBackgroundJob("cleanup_pipeline", cleanup.CleanupCron.pipeline(cfg)(cfg.pools.defaultExecutor))
runBackgroundJob("sweeper", cleanup.Sweeper.process(cfg))
runBackgroundJob("deployment_monitor", DeploymentMonitor.loop(cfg))

The idea then is to take the subset of these background jobs that constitute the "data plane" and instead of having both a producer and consumer inside Nelson, have only the producer and relegate the consumption to the downstream data plane.

The current thinking is these will stay in the control plane:

These will be relegated to the data plane:

For each of the data plane components, instead of being consumed by an implementation that actually acts on the event, it will instead be emitted to any subscribers listening on a network port. It is then on the subscriber to act on this information.

Implementation

Because deploying to a scheduler is the largest burden at the moment, the pipeline processor will be our first target. However because launching, deleting, and health checking are all scheduler-specific functionality, we cannot simply just migrate the pipeline processor, but also the cleanup pipeline, sweeper, and deployment monitor. The routing cron can likely be left as a separate step. Thus the migration order is:

  1. pipeline processor + cleanup pipeline + sweeper + deployment monitor
  2. routing cron

As for how to emit events, current proposals are:

Implementation steps

  1. Split the pipeline processor and cleanup pipeline into their distinctive control plane/data plane parts - e.g. "what to deploy" vs. "how to deploy" and "mark as garbage" vs. "sweep garbage"
  2. Come up with the Protobuf data models for the events
  3. Write a reference implementation of a data plane that mimics the status quo
  4. Sink the pipeline processor, cleanup pipeline, sweeper, and deployment monitor into a network port
  5. Migrate the routing cron

Other notes

Testing: Rearchitecting Nelson into a control plane may also bring benefits to testing and demonstrations. Right now it has been hard to showcase Nelson because the control plane and data plane are tied together and thus require things like a functional Kubernetes cluster to startup. If instead Nelson just emitted events, we could say, have a dummy scheduler interpret those events for demo purposes, or even have something that interprets those events as D3.js actions where "deploy X" becomes rendering a node, service dependencies become edges, traffic shifting becomes moving edges, and garbage collection becomes deleting nodes.

adelbertc commented 5 years ago

Meeting notes 2019-08-13

Participants: @adelbertc @drewgonzales360 @lu4nm3 @miranhpark @timperrett

Terminology

CP: Nelson Control Plane DC: Logical datacenter - even if they are located in the same physical DC but use different schedulers, each of those schedulers are interpreted as a separate DC (perhaps 'Domain' would be a better phrase [#53] but this is the terminology Nelson currently uses.) Agents: The "data plane" component(s) that will pull messages from the Nelson Control Plane. It is assumed each individual scheduler/datacenter will have one agent in charge of it.

Notes

                 ------
                 | CP |
                 ------
Scheduler A     /      \   Scheduler B
---------------/--    --\---------------
|             /  |    |  \             |
|    ----------  |    | -----------    |
|    |Agent A |  |    | | Agent B |    |
|    ----------  |    | -----------    |
------------------    ------------------

High-level flow

  1. Agents connect to the CP and register themselves as the agent for a particular DC. This agent is assumed to be the singular agent in charge of that DC. If this is the first time Nelson has seen this DC it will provision an internal queue for events for that DC.
    • If another agent later tries to register for the same DC the CP will optimistically assume a new Agent has taken over.
    • This also means enabling a DC no longer requires a configuration change and bounce, it just needs an Agent to register itself.
  2. When a Deployment request comes in, Nelson will figure out what to deploy and insert that event to the corresponding queue.
  3. Concurrently, the routing cron and GC processes will periodically figure out what they need to do from the DB, but instead of actually reaching out to Consul or the scheduler, will insert these as events into the queue.
  4. Periodically the agent is expected to reach out to the CP to ask for work.
    • This is a conscious choice to make distributing work pull-based as opposed to push-based as a pull-based approach only requires scheduler security policies to require dial-out which is more common than the approach of also allowing dialing-in.
    • The CP will batch and serialize events in the order they appear in the queue and respond. The agent is expected to work on and ACK these events in the correct order.
  5. As the agent completes work it will send an ACK to the CP. As part of this ACK message it will also contain free-form diagnostic information associating with completing the work that can be used for debugging (e.g. displayed on the Nelson CLI).
  6. Until an event is ACK'd by the Agent the CP will not remove the event from the queue.
    • This has the effect that if for whatever reason the Agent does not ACK events quick enough (e.g. Agent goes down for a while, processing is slow), the queue will back up. The current model bounds the size of the queue and drops messages once the maximum limit is hit. A similar approach will be taken here, dropping oldest messages first.

Background details

Open questions

  1. Does the CP expect an explicit heartbeat every N seconds from an Agent? Or does an Agent just need to register itself once in the beginning and the CP will expect the Agent is "active" until another Agent registers itself (and thus becomes the new active Agent)?
  2. We expect Agents to provide a snapshot summary of the runtime systems they manage - do we expect these summaries to be provided only when it asks for work from the CP, only when it ACKs work, or both?
  3. Because Agents are now pushing diagnostics to the CP as opposed to the current model where it is closer to a pull behavior, there will be some staleness - is this OK? This is related to the answer to the second question above.
goedelsoup commented 5 years ago

A few questions that arise immediately for me:

  1. The definition of an agent implies a cluster/LAN singleton. Can we better enforce this via a semaphore built in Consul or DynamoDB? Perhaps this should also be pluggable. I'd prefer the constraint of only one can run over advising to only ever run one.
  2. What prevents queue durability? Can we not offer configurations backed by both an in-memory fs2 implementation as well as an SQS backed one?
  3. In regards to the open questions - is there a straight-forward path to just use a known gossip protocol like Raft within the JVM? I think this could significantly narrow the conversation on what's acceptable and how to implement it.
  4. I think the security model on the CP channel might need some more investigation. If we are sending hydrated blueprints, we are no longer potentially holding secrets in memory but also over the wire. This is a compliance nit, for sure.