Feature Name: blueprints for scheduler API interfaces
Start Date: August 9, 2018

Summary

With the growing complexity of prevalent scheduling interfaces - Kubernetes being the obvious, bloated example - it simply cannot make sense for Nelson to continue to wrap and hide the details of those scheduling systems in entirety; those options exist for good reason, but in doing so, those interfaces overwhelm users with complexity and wind up making easy things hard, and hard things easier (note: not "easy"). This seems to be an unacceptable trade off in developer experience, and whilst the author recognizes that someone must deal with this complexity, cascading it through an organization under the flag of so-called DevOps is not the way to do things. Instead, we must look for better interfaces - more optiomal trade offs. In this frame, the author has been reflecting on the proposal in https://github.com/getnelson/nelson/issues/68. This proposal was the first step in considering how to support more sophisticated, broader use-cases, and thus enable adoption of Nelson by a greater number of users, but was in and of itself quite limited. The addition of sidecar routing would have enabled more options for routing topologies, but what about voluming or node selection for hardware specialization? Many cases would be left still unsolved, and the manifest would have adopted a range of additional boilerplate, providing indirection but not abstraction. The power of Nelson for users comes from the fact that it enables the developer to think solely about their application - this is a property we should strive to retain and prioritize as a north star.

The author would like to propose that no organization can succeed with zero people who understand the scheudling interfaces that they have elected to use. If we assume that the number of staff understanding said interfaces is greater than zero, then we can assume that there is an agent whom is able to configure a so-called "blueprints".

Blueprints

Every submission to a scheudler has to make certain assumptions: perhaps its something simple like what sidecar you might use for log extraction, or maybe its something more sophisticated around how you handle routing or security... whatever it is, there is an agent - an operator / admin / engineer - who understands and knowingly makes those trade-offs. What I would like to propose is that we leverage this fact, and provide the ability for a workflow to take a "Blueprint", where Nelson will essentially execute a transformation on the input template, substituing values and transposing in the Nelson values (more details below), then sending the tempate to the scheudler. An example might be (for a k8s config):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ stack_name }}
  labels:
    stack: {{ stack_name }}
spec:
  replicas: {{ instances.desired }}
  selector:
    matchLabels:
      stack: {{ stack_name }}
  template:
    metadata:
      labels:
        stack: {{ stack_name }}
    spec:
      containers:
      - name: {{ stack_name }}
        image: {{ deployable.container }}
        ports:
        - containerPort: {{ ports.default }}

Whilst this is only a potential example, it should hopefully illustrate the general idea. An administrator seeds Nelson with this blueprint, and then at runtime Nelson fuses that scheudler-specific template with a smattering of information from the manifest. This could mean that a nelson manifest then looks something like this:

[...]
units:
  - name: foobar
    description: description for the foo service
    ports:
      - default->9000/http
    dependencies:
      - ref: inventory@1.4
    workflow:
      kind: magentar
      blueprint: use-nvidia-1080ti@3
    meta:
      - foobar
      - buzzfiz
[...]

Here we see the ability to specify a blueprint, and a revision of that blueprint, to be used with the workflow loaded by name. This would mean that at execution time the workflow would load the blueprint and fuse the Nelson data before sending it to the scheudler defined in the workflow (specification of which is out of scope for this RFC).

Implementation

Everything in Nelson was very conciously versioned. We strived for this even from the early days of the project and it has helped immensely as it has meant we could largely use append-only storage, ignore mutability and a wide variety of race conditions. The one place that we do not do this is in the admin-operator supplied configuration which requires a reboot of the Nelson system. At the time of writing, this allowed the operator to essentially change things - mutably - about the way deployments happened. Prior to Nelson becoming open-source this actually tripped us up a few times when we broke the log exporters, and then would have a set of deployments that were not exporting logs, unbeknownst to their user who had not changed anything. This is a crappy experience. With this frame, we should consider the addition of a blueprints API where administrators can add blueprints by name, and we generate a revision against that blueprint. Submitting a blueprint might look something like this:

{
  "name": "use-nvidia-1080ti",
  "description": "only scheudle on nodes with nvida 1080ti hardware"
  "content": "<base64 encoded template>"
}

And Nelson might respond with something like:

{
  "name": "use-nvidia-1080ti",
  "sha256": "41242f08c7a0bdfcee03f23864ee096966ab7982384ea9482fd8484b9ba49256",
  "revision": 6
  "state": "pending"
  "created_at": "2018-08-09 20:28:05Z"
}

The response will indicate that Nelson accepted the blueprint and that it is pending validation. What does pending mean? Adding a new blueprint does not mean that it is ready for use - as the operator could have essentially changed / broken anything, before making it available it should be validated.

States

The author proposes the following blueprints states:

pending: the blueprint is new, and awaiting validation
validating: presently validating that this new blueprint does not result in error when submitted. Exactly how this should work is up for discussion, but it feels needed to prevent immediate breakages for end-users.
active: the normal state for a given blueprint revision
deprecated: the relative "dead" state for a given blueprint. As operators we can never guarentee that a given blueprint is never used anywhere in an ecosystem, this state is for book-keeping and indicating to users that they should upgrade.
invalid: the blueprint failed validation and is not available for use

Manifest Syntax

Additionally, the author also proposes the following options for manifest syntax in an effort to mittigate drift for various Nelson manifests accross repos:

Direct reference to the revision (for the most change sensitive users)

workflow:
  kind: magentar
  blueprint: use-nvidia-1080ti@3

For systems that need special vetting or have business-critical impact, fixing to a specific revision of the deployment blueprint could be highly advantagous.

HEAD reference to the revision (for the lead change sensitive users)

workflow:
  kind: magentar
  blueprint: use-nvidia-1080ti@HEAD

When using HEAD, the latest active state revision of a given blueprint will be used. In general, this should be fine for the majority of users working on non-edge systems.

Open Questions

When new blueprint is submitted, how do we go about checking that it works at all?
- seems like this will be required to prevent total breakages in the system
When Nelson first boots up, do we have zero blueprints? What is the on-boarding flow?
- What about shipping some (any?) stock blueprints?
Should users only ever depend on HEAD of a blueprint? What's the upgrade flow?
Audit API to say which blueprints are related to which units
- required for deprecating or upgrading blueprints

@adelbertc @stew this is a distillation of my thinking from lunch today. Also adding @okoye - he might have some thoughts on this from his work on Spinnaker.

As an additional data point, Spinnaker actually did go down the path of maintaining their own object model and wrapping everything: https://www.spinnaker.io/reference/providers/kubernetes-v2/ - what’s interesting here is that they end up with some very specific naming conventions and things they expect the deployment to look like. Essentially falling fowl of the problems this RFC looks to avoid. If we can learn from their experience at all, that would be great

This all looks great to me, and sounds much better than the alternative of proxying the gajillion different knobs Kubernetes gives you.

As an observation, it seems v1 of Spinnaker's K8s Provider was where v1 of Nelson's manifest was heading (see #69 and #78), and v2 is similar to this proposal. One difference is Spinnaker gives service owners essentially full access to their namespaces, allowing them to specify the K8s spec in full and also allowing them to delete workloads/objects. In contrast Nelson explicitly only allows creation (workflow) and querying (Nelson CLI), but restricts deletion since Nelson wants to control the lifecycle to make immutable deployments manageable.

paging @kaiserpelagic for his thoughts also

Some more thoughts:

Blueprints

In the above manifest you have the blueprint on a per-unit basis instead of a per-plan basis. In the context of Kubernetes, the JSON/YAML differs depending on if you're creating a Deployment vs. a Job vs. a CronJob. Since the plan between environments can differ (for instance at Target we have a dev plan that is schedule: once (Job) and a prod plan that is schedule: "0 6 * * *) (CronJob), the blueprint has to be tied to the plan.

One way around this is to require blueprints have definitions for all three of Deployments, Job, and CronJobs, and interpolate depending on the plan. This seems very cumbersome and perhaps too restrictive - I do not think it would be unreasonable to have a blueprint that only makes sense for certain workloads, such as a blueprint for workloads that need GPUs that only (Cron)Jobs can use.

Another way might be to only allow blueprints to control the Pod-specific bits of the configuration, since (I think) Deployments, Jobs, and CronJobs all have the Pod specification as a subset of their spec. This is flexible enough to allow for things like exposing ports, sidecars, healthchecks, but (cron) job specific stuff like backoff limits, completions, parallelism would be left out.. and those seem quite useful.

More thoughts to come..

States of the blueprint lifecycle

deprecated: the relative "dead" state for a given blueprint. As operators we can never guarentee that a given blueprint is never used anywhere in an ecosystem, this state is for book-keeping and indicating to users that they should upgrade.

Related to this, your comment in the Questions section about the Audit API suggests Nelson will do bookkeeping on blueprint states not unlike what it does for stack lifecycles. If we decide to move forward with such bookkeeping, and we track what stack deployments use which blueprints, we can do the same thing we do with deprecated stacks and disallow new deployments to depend on deprecated blueprints. Things are also slightly simplified here since no cleanup is needed.

One use case for wanting to do this bookkeeping is if a blueprint deploys a sidecar which is later deemed to have a vulnerability or something - it would be useful for admins to be able to not only deprecate the blueprint to prevent further deployments using it, but also figure out what deployments are currently using it.

Questions

When new blueprint is submitted, how do we go about checking that it works at all? Seems like this will be required to prevent total breakages in the system.
- kubectl ostensibly has a --validate flag that can be used with --dry-run to, dare I say, type check the object configuration. However looking at the K8s API docs there does not seem to be an API endpoint to hit to do validation.. unsure how kubectl itself is doing it. This of course is a Kubernetes-specific point, similar validation for say Nomad or Mesos would have to depend on the toolchain they provide.
When Nelson first boots up, do we have zero blueprints? What is the on-boarding flow? What about shipping some (any?) stock blueprints?
- I think shipping some stock blueprints would be useful both for demo and getting started purposes. I can imagine very vanilla deployments that are say, one instance, no sidecar, no healthchecks, no volumes, no ports exposed, etc.
Should users only ever depend on HEAD of a blueprint? What's the upgrade flow?
- I don't think this restriction is necessary, users should be able to depend on whatever they want. The only exception I can think of is forbidding depending on a deprecated blueprint, similar to what we do for service dependencies right now. More thoughts below..
Audit API to say which blueprints are related to which units are required for deprecating or upgrading blueprints
- See my above comment on blueprint lifecycles

Other questions I have

Do we want to have a similar blueprint for customizing load balancers (K8s Ingress/Service) and potentially other resources?

Disclaimer: Everything I've said above I've mostly come at from a Kubernetes angle, we should also think about these problems and decisions from other angles. One of Nelson's biggest strengths is being scheduler-agnostic and I think we should maintain that goal as much as possible.

P.S. Our K8s workflow is Canopus, Magnetar is Nomad! 😛

This is interesting but I see a couple of issues. Nelson works today by transforming the manifest in very specific ways, i.e. if port default->8080/http is declared in the manifest then the user can be sure it will be exposed on the container. If one uses a template this guarantee is void, because there is no guarantee that the teamplate properly handled it exposing the port. I think this is what you were trying to get at with validating the template, but I'm not convinced it can be done in practice. It's also not clear how or if templates need to be applied to different types of deployments, jobs or services.

Taking a step back, templates seem to be simliar to plans in that they describes the how of a deployment. I'm not sure yet how that's useful yet, but something that struck me while reading the RFC.

I'll continue to think about this one.

@kaiserpelagic @adelbertc Excellent thoughts, thanks for taking the time to read this.

By far and away the thing that stands out to me most is this impedance miss-match with the plan, as @kaiserpelagic logic mentioned, and @adelbertc expanded on with the point about Deployment vs Job vs CronJob (in k8s land). This is indeed something we will have to make a design trade off on and the solution will not be perfect. In all software development, we make trade offs between safety and something else (often performance or convenience). In this case, I think the long term viability of our current approach is not good, even though it is strictly the safest implementation where we know we have good guarantees about how things will be supplied to the runtime. At the moment i'm thinking @kaiserpelagic is correct that blueprints do not belong on units, and instead should be bound at plan time, but conversely, that the location we presently have workflow in the manifest is incorrect, as he points out (correctly) that the plan is the "how" of Nelson.

[...]
units:
  - name: foobar
    description: description for the foo service
    ports:
      - default->9000/http
    dependencies:
      - ref: inventory@1.4
    meta:
      - foobar
      - buzzfiz

plans:
  - name: dev-plan
    cpu: 0.25
    memory: 2048
    workflow:
      kind: magentar
      blueprint: use-nvidia-1080ti@HEAD
[...]

If we do this, we gain:

The ability to have separate blueprints for discrete types of task (job, cronjob, deployment, daemonset etc)
Ability to deploy units to different schedulers (stage vs prod); im not sure if this is even a good idea in practice, but its doable here.

However, we compromise on:

Ultimately we cannot validate templates for usage of our entire template gramer (whatever that is), because it might not be valid for the use-case to use all things. This does somewhat - potentially - subvert the nature of plans. However, given blueprints are not a "user" primitive, but, rather, an admin one, I would hope that admins would be sufficiently incentivized to provide the right knobs to users (e.g. the size of the deployment).

@adelbertc onto your specific items:

Another way might be to only allow blueprints to control the Pod-specific bits of the configuration, since (I think) Deployments, Jobs, and CronJobs all have the Pod specification as a subset of their spec. This is flexible enough to allow for things like exposing ports, sidecars, healthchecks, but (cron) job specific stuff like backoff limits, completions, parallelism would be left out.. and those seem quite useful.

I'm actually against this, because otherwise we're locked into a certain implementation details around pods. Perhaps you wanted to use just services for example. Need to think about this further, but there is a discrete and very deliberate trade off we need to make in the power we're exposing.

kubectl ostensibly has a --validate flag that can be used with --dry-run to, dare I say, type check the object configuration. However looking at the K8s API docs there does not seem to be an API endpoint to hit to do validation.. unsure how kubectl itself is doing it.

AFAIK, kubectl is part fat-client, and they do client-side validation that actually isn't that good :-) I've been using kubeval in place of it because its a better linter.

👍 💯 on having a clean separation of "what" (unit) and "how" (plan) in the manifest.

I've noticed the strangeness of having the workflow being per-unit - I think we've been getting away with it because we've only ever had one workflow (per scheduler) and so even if it were per-plan functionally there wouldn't be a difference. This of course is not a safe assumption and we should definitely move it in v2.

Re: the 2 methods I suggested on the impedance mismatch - I do not like either of them as well, but figured I'd throw them out there for discussion's sake 😄 Moving the blueprint and workflow under 'plan' is much nicer.

Re:

I think the long term viability of our current approach is not good, even though it is strictly the safest implementation where we know we have good guarantees about how things will be supplied to the runtime

To this point, at Target we have a job which has a dev plan of schedule: once and a prod plan of schedule: "0 6 * * *", the idea being when we first do a release we want to see it run immediately to see if it succeeds, at which point we can promote it to stage and prod where it is cron-scheduled.

Because of the strictness of the current manifest, we have some level of confidence that the deployment configuration of both the Job and CronJob is identical, modulo the schedule. I imagine the same argument can be made for most other workloads, where the dev and prod configuration is largely the same modulo some minor details. In the proposed, more flexible v2 manifest, we lose this confidence.

That being said, I present this use case just as an observation - applying the end-to-end argument suggests we should favor allowing it to be configured per-plan. If users want to re-gain this functionality it can be implemented with an external tool/lint step, whereas if we retain the stricter manifest there is no way out.

So it sounds like we're converging on a plan here:

workflow moves to plan and blueprint go to plan. The manifest will take a major revision hit, and we will remove the support for per-unit workflow specifications.
blueprints will be discrete for different job types. It's possible that people will screw up the intersection of behaviors, but ultimately end users have to rely on operators doing the right thing. We are OK with the power/complexity trade-off.
blueprints will have its own API which will be append-only and revisioned. Most users will depend on HEAD revision for a given blueprint.
Nelson will ship with a small handful of default blueprints (these will be extracted from the existing scheduler support implementations). These will be bare bones and only cover simplistic use cases.
When creating a new blueprint, we do not know how to validate it - its possible we could try to deploy something with it, but this is potentially fraught with problems. For the moment, we will transition directly from pending to active as a NoOp, allowing for enhancement later.

Does this sound accurate?

getnelson / nelson

RFC: blueprints for scheduler API interfaces #79

Summary

Blueprints

Implementation

States

Manifest Syntax

Open Questions

Blueprints

States of the blueprint lifecycle

Questions

Other questions I have