Closed timperrett closed 6 years ago
@adelbertc @stew this is a distillation of my thinking from lunch today. Also adding @okoye - he might have some thoughts on this from his work on Spinnaker.
As an additional data point, Spinnaker actually did go down the path of maintaining their own object model and wrapping everything: https://www.spinnaker.io/reference/providers/kubernetes-v2/ - whatβs interesting here is that they end up with some very specific naming conventions and things they expect the deployment to look like. Essentially falling fowl of the problems this RFC looks to avoid. If we can learn from their experience at all, that would be great
This all looks great to me, and sounds much better than the alternative of proxying the gajillion different knobs Kubernetes gives you.
As an observation, it seems v1 of Spinnaker's K8s Provider was where v1 of Nelson's manifest was heading (see #69 and #78), and v2 is similar to this proposal. One difference is Spinnaker gives service owners essentially full access to their namespaces, allowing them to specify the K8s spec in full and also allowing them to delete workloads/objects. In contrast Nelson explicitly only allows creation (workflow) and querying (Nelson CLI), but restricts deletion since Nelson wants to control the lifecycle to make immutable deployments manageable.
paging @kaiserpelagic for his thoughts also
Some more thoughts:
In the above manifest you have the blueprint on a per-unit basis instead of a per-plan basis. In the context of Kubernetes, the JSON/YAML differs depending on if you're creating a Deployment vs. a Job vs. a CronJob. Since the plan between environments can differ (for instance at Target we have a dev
plan that is schedule: once
(Job) and a prod
plan that is schedule: "0 6 * * *)
(CronJob), the blueprint has to be tied to the plan.
One way around this is to require blueprints have definitions for all three of Deployments, Job, and CronJobs, and interpolate depending on the plan. This seems very cumbersome and perhaps too restrictive - I do not think it would be unreasonable to have a blueprint that only makes sense for certain workloads, such as a blueprint for workloads that need GPUs that only (Cron)Jobs can use.
Another way might be to only allow blueprints to control the Pod-specific bits of the configuration, since (I think) Deployments, Jobs, and CronJobs all have the Pod specification as a subset of their spec. This is flexible enough to allow for things like exposing ports, sidecars, healthchecks, but (cron) job specific stuff like backoff limits, completions, parallelism would be left out.. and those seem quite useful.
More thoughts to come..
deprecated: the relative "dead" state for a given blueprint. As operators we can never guarentee that a given blueprint is never used anywhere in an ecosystem, this state is for book-keeping and indicating to users that they should upgrade.
Related to this, your comment in the Questions section about the Audit API suggests Nelson will do bookkeeping on blueprint states not unlike what it does for stack lifecycles. If we decide to move forward with such bookkeeping, and we track what stack deployments use which blueprints, we can do the same thing we do with deprecated stacks and disallow new deployments to depend on deprecated blueprints. Things are also slightly simplified here since no cleanup is needed.
One use case for wanting to do this bookkeeping is if a blueprint deploys a sidecar which is later deemed to have a vulnerability or something - it would be useful for admins to be able to not only deprecate the blueprint to prevent further deployments using it, but also figure out what deployments are currently using it.
kubectl
ostensibly has a --validate
flag that can be used with --dry-run
to, dare I say, type check the object configuration. However looking at the K8s API docs there does not seem to be an API endpoint to hit to do validation.. unsure how kubectl
itself is doing it. This of course is a Kubernetes-specific point, similar validation for say Nomad or Mesos would have to depend on the toolchain they provide.Disclaimer: Everything I've said above I've mostly come at from a Kubernetes angle, we should also think about these problems and decisions from other angles. One of Nelson's biggest strengths is being scheduler-agnostic and I think we should maintain that goal as much as possible.
P.S. Our K8s workflow is Canopus, Magnetar is Nomad! π
This is interesting but I see a couple of issues. Nelson works today by transforming the manifest in very specific ways, i.e. if port default->8080/http
is declared in the manifest then the user can be sure it will be exposed on the container. If one uses a template this guarantee is void, because there is no guarantee that the teamplate properly handled it exposing the port. I think this is what you were trying to get at with validating the template, but I'm not convinced it can be done in practice. It's also not clear how or if templates need to be applied to different types of deployments, jobs or services.
Taking a step back, templates seem to be simliar to plan
s in that they describes the how of a deployment. I'm not sure yet how that's useful yet, but something that struck me while reading the RFC.
I'll continue to think about this one.
@kaiserpelagic @adelbertc Excellent thoughts, thanks for taking the time to read this.
By far and away the thing that stands out to me most is this impedance miss-match with the plan
, as @kaiserpelagic logic mentioned, and @adelbertc expanded on with the point about Deployment vs Job vs CronJob (in k8s land). This is indeed something we will have to make a design trade off on and the solution will not be perfect. In all software development, we make trade offs between safety and something else (often performance or convenience). In this case, I think the long term viability of our current approach is not good, even though it is strictly the safest implementation where we know we have good guarantees about how things will be supplied to the runtime. At the moment i'm thinking @kaiserpelagic is correct that blueprints do not belong on units, and instead should be bound at plan time, but conversely, that the location we presently have workflow
in the manifest is incorrect, as he points out (correctly) that the plan
is the "how" of Nelson.
[...]
units:
- name: foobar
description: description for the foo service
ports:
- default->9000/http
dependencies:
- ref: inventory@1.4
meta:
- foobar
- buzzfiz
plans:
- name: dev-plan
cpu: 0.25
memory: 2048
workflow:
kind: magentar
blueprint: use-nvidia-1080ti@HEAD
[...]
If we do this, we gain:
However, we compromise on:
@adelbertc onto your specific items:
Another way might be to only allow blueprints to control the Pod-specific bits of the configuration, since (I think) Deployments, Jobs, and CronJobs all have the Pod specification as a subset of their spec. This is flexible enough to allow for things like exposing ports, sidecars, healthchecks, but (cron) job specific stuff like backoff limits, completions, parallelism would be left out.. and those seem quite useful.
I'm actually against this, because otherwise we're locked into a certain implementation details around pods. Perhaps you wanted to use just services for example. Need to think about this further, but there is a discrete and very deliberate trade off we need to make in the power we're exposing.
kubectl ostensibly has a --validate flag that can be used with --dry-run to, dare I say, type check the object configuration. However looking at the K8s API docs there does not seem to be an API endpoint to hit to do validation.. unsure how kubectl itself is doing it.
AFAIK, kubectl is part fat-client, and they do client-side validation that actually isn't that good :-) I've been using kubeval
in place of it because its a better linter.
π π― on having a clean separation of "what" (unit) and "how" (plan) in the manifest.
I've noticed the strangeness of having the workflow being per-unit - I think we've been getting away with it because we've only ever had one workflow (per scheduler) and so even if it were per-plan functionally there wouldn't be a difference. This of course is not a safe assumption and we should definitely move it in v2.
Re: the 2 methods I suggested on the impedance mismatch - I do not like either of them as well, but figured I'd throw them out there for discussion's sake π Moving the blueprint and workflow under 'plan' is much nicer.
Re:
I think the long term viability of our current approach is not good, even though it is strictly the safest implementation where we know we have good guarantees about how things will be supplied to the runtime
To this point, at Target we have a job which has a dev
plan of schedule: once
and a prod
plan of schedule: "0 6 * * *"
, the idea being when we first do a release we want to see it run immediately to see if it succeeds, at which point we can promote it to stage and prod where it is cron-scheduled.
Because of the strictness of the current manifest, we have some level of confidence that the deployment configuration of both the Job and CronJob is identical, modulo the schedule. I imagine the same argument can be made for most other workloads, where the dev and prod configuration is largely the same modulo some minor details. In the proposed, more flexible v2 manifest, we lose this confidence.
That being said, I present this use case just as an observation - applying the end-to-end argument suggests we should favor allowing it to be configured per-plan. If users want to re-gain this functionality it can be implemented with an external tool/lint step, whereas if we retain the stricter manifest there is no way out.
So it sounds like we're converging on a plan here:
workflow
moves to plan
and blueprint
go to plan
. The manifest will take a major revision hit, and we will remove the support for per-unit workflow specifications.HEAD
revision for a given blueprint.pending
to active
as a NoOp, allowing for enhancement later. Does this sound accurate?
LGTM π
LGTM
Summary
With the growing complexity of prevalent scheduling interfaces - Kubernetes being the obvious, bloated example - it simply cannot make sense for Nelson to continue to wrap and hide the details of those scheduling systems in entirety; those options exist for good reason, but in doing so, those interfaces overwhelm users with complexity and wind up making easy things hard, and hard things easier (note: not "easy"). This seems to be an unacceptable trade off in developer experience, and whilst the author recognizes that someone must deal with this complexity, cascading it through an organization under the flag of so-called DevOps is not the way to do things. Instead, we must look for better interfaces - more optiomal trade offs. In this frame, the author has been reflecting on the proposal in https://github.com/getnelson/nelson/issues/68. This proposal was the first step in considering how to support more sophisticated, broader use-cases, and thus enable adoption of Nelson by a greater number of users, but was in and of itself quite limited. The addition of sidecar routing would have enabled more options for routing topologies, but what about voluming or node selection for hardware specialization? Many cases would be left still unsolved, and the manifest would have adopted a range of additional boilerplate, providing indirection but not abstraction. The power of Nelson for users comes from the fact that it enables the developer to think solely about their application - this is a property we should strive to retain and prioritize as a north star.
The author would like to propose that no organization can succeed with zero people who understand the scheudling interfaces that they have elected to use. If we assume that the number of staff understanding said interfaces is greater than zero, then we can assume that there is an agent whom is able to configure a so-called "blueprints".
Blueprints
Every submission to a scheudler has to make certain assumptions: perhaps its something simple like what sidecar you might use for log extraction, or maybe its something more sophisticated around how you handle routing or security... whatever it is, there is an agent - an operator / admin / engineer - who understands and knowingly makes those trade-offs. What I would like to propose is that we leverage this fact, and provide the ability for a workflow to take a "Blueprint", where Nelson will essentially execute a transformation on the input template, substituing values and transposing in the Nelson values (more details below), then sending the tempate to the scheudler. An example might be (for a k8s config):
Whilst this is only a potential example, it should hopefully illustrate the general idea. An administrator seeds Nelson with this blueprint, and then at runtime Nelson fuses that scheudler-specific template with a smattering of information from the manifest. This could mean that a nelson manifest then looks something like this:
Here we see the ability to specify a blueprint, and a revision of that blueprint, to be used with the workflow loaded by name. This would mean that at execution time the workflow would load the blueprint and fuse the Nelson data before sending it to the scheudler defined in the workflow (specification of which is out of scope for this RFC).
Implementation
Everything in Nelson was very conciously versioned. We strived for this even from the early days of the project and it has helped immensely as it has meant we could largely use append-only storage, ignore mutability and a wide variety of race conditions. The one place that we do not do this is in the admin-operator supplied configuration which requires a reboot of the Nelson system. At the time of writing, this allowed the operator to essentially change things - mutably - about the way deployments happened. Prior to Nelson becoming open-source this actually tripped us up a few times when we broke the log exporters, and then would have a set of deployments that were not exporting logs, unbeknownst to their user who had not changed anything. This is a crappy experience. With this frame, we should consider the addition of a blueprints API where administrators can add blueprints by name, and we generate a revision against that blueprint. Submitting a blueprint might look something like this:
And Nelson might respond with something like:
The response will indicate that Nelson accepted the blueprint and that it is pending validation. What does
pending
mean? Adding a new blueprint does not mean that it is ready for use - as the operator could have essentially changed / broken anything, before making it available it should be validated.States
The author proposes the following blueprints states:
pending
: the blueprint is new, and awaiting validationvalidating
: presently validating that this new blueprint does not result in error when submitted. Exactly how this should work is up for discussion, but it feels needed to prevent immediate breakages for end-users.active
: the normal state for a given blueprint revisiondeprecated
: the relative "dead" state for a given blueprint. As operators we can never guarentee that a given blueprint is never used anywhere in an ecosystem, this state is for book-keeping and indicating to users that they should upgrade.invalid
: the blueprint failed validation and is not available for useManifest Syntax
Additionally, the author also proposes the following options for manifest syntax in an effort to mittigate drift for various Nelson manifests accross repos:
For systems that need special vetting or have business-critical impact, fixing to a specific revision of the deployment blueprint could be highly advantagous.
HEAD
reference to the revision (for the lead change sensitive users)When using
HEAD
, the latestactive
state revision of a given blueprint will be used. In general, this should be fine for the majority of users working on non-edge systems.Open Questions
When new blueprint is submitted, how do we go about checking that it works at all?
When Nelson first boots up, do we have zero blueprints? What is the on-boarding flow?
Should users only ever depend on HEAD of a blueprint? What's the upgrade flow?
Audit API to say which blueprints are related to which units