att-comdev / promenade

This project has moved to OpenStack.
https://www.airshipit.org/
Apache License 2.0
11 stars 15 forks source link

Host Validation Doc #33

Closed aric49 closed 6 years ago

aric49 commented 7 years ago

A document outlining proposed host validation tests.

gardlt commented 7 years ago

Is this suppose to be a feature enhancement proposal?

aric49 commented 7 years ago

@gardlt Kinda --- this is from a conversation @mark-burnett and I had yesterday regarding the best way to tackle smoke testing to validate that hosts come up properly during a Promenade Kubernetes deployment. The goal of this document is to provide insight into what will be in scope for Promenade vs other tools like Armada, Prometheus, or other configuration management platforms which may run in conjunction with Promenade. The idea of this document is to get the conversation started and iterate on it.

alanmeadows commented 7 years ago

One further thought.

We should evaluate some of the code in the Kubernetes CI/CD gating system for this purpose. We obviously know that in addition to very specific code paths and complex testing to ensure they haven't broken edge case code, they also likely have tests for some of the high level elements we're attempting to validate here.

We should explore what they have and how easy it would be benefit our self-contained version based on tests they've already outlined.

At the very least, much of what they test can educate us on what we need to evaluate. For example, this document isn't low level enough to suggest, but eventually I expect this component (or external project) to articulate use cases along the lines of:

https://github.com/kubernetes/kubernetes/blob/master/test/e2e_node/kubelet_test.go#L40-L71

I am not suggesting that we recreate all of the integration tests Kubernetes already does, but these tests like this help validate basic operations with the full stack (our k8s, our docker version, our operating system configuration, where we are storing docker logs, etc.) in running environments and most importantly for us in these larger environments on a per node basis to check for abhorrent behavior earlier rather then later.

mark-burnett commented 7 years ago

@intlabs @wilkers-steve @aric49 @alanmeadows @v1k0d3n

CC: @bryan-strassner

I agree with a lot of the good points made here, especially regarding the focus of responsibility of particular services/code.

It feels like the three main categories of validation that have come up are:

  1. Host configuration validation
    • Presence/content of particular files on nodes
    • Versions of installed packages (in particular docker since Promenade is currently installing that directly, as called out by @alanmeadows)
  2. Overall cluster validation
    • Exercising the Kubernetes API/use some Kubernetes integration tests
    • Reachability among pods running on all nodes to one another
    • Ability to allocate cluster resources to all (appropriate) nodes, e.g. PVs
  3. Critical application validation
    • Primarily etcd cluster health?
    • Other applications Drydock, MaaS, Ceph, Airflow, etc.?

It actually seems that these categories could each reasonably be delegated to different components. There's no reason to really expect a general cluster validation tool to check packages and files on nodes. Likewise, while tools like ansible are pretty good at doing tasks in category 1, I don't really love the idea of writing an ansible module to check etcd health/configuration.

To me the questions become:

  1. How should we prioritize these different aspects of validation? It seems to me that 2. is more likely to expose serious problems and probably should be implemented first. On the other hand, the story @aric49 was working on seemed to be more about 1 (I might be mis-interpreting the intent).
  2. How much should we worry about separating these concerns now vs. later?
  3. Will other applications need to initiate these various kinds of checks besides Promenade at cluster formation?
mark-burnett commented 6 years ago

Closing as we have moved to gerrithub.