bkase commented 4 years ago

Problem

We don't support incremental deployments in the current infrastructure setup. Concretely, this means if we want to tweak the data we persist to google storage for our points service, for example, we have tear-down and redeploy the whole network. We've wanted to do hotfixes like the one in the example a few times this week and it is causing some wasted time for our protocol engineers.

~~We also don't want to invest in too many more changes to our current helm-template setup as we intend to rewrite in Dhall-helm.~~

~~## Potential Solution~~

@nholland94 suggested a solution where we mostly only change terraform (and not have to touch much of the helm) by taking advantage of Google Cloud's policy of auto-updating any *:latest containers: Essentially what we usually put in the tag, we could instead put in the image name (service, commit hash, etc) and then a hotfix is just pushing a new container at that image name to latest.

@yourbuddyconner's solution seems like a good approach!

Artifact

A small writeup suggesting a solution in the short term or rationale for waiting until we've fully ported everything to Dhall.

Clarification

We do not want to abuse this capability to change the sorts of networks we're testing on without doing a redeploy. During QA periods, it should only be used for sidecar services. In production networks, we may need to use this ability to roll a hotfix incrementally to prevent any downtime.

emberian commented 4 years ago

The philosophical problem with this is that then we're not actually testing what we're releasing anymore. We've introduced bugs that are only present at network launch before, or during the first k blocks. Hotfixes invalidate the testing methodology. I guess that's not an argument for disallowing hotfixes, but instead for disallowing hotfixed networks to be considered a greenlight for releasing.

bkase commented 4 years ago

The philosophical problem with this is that then we're not actually testing what we're releasing anymore. We've introduced bugs that are only present at network launch before, or during the first k blocks. Hotfixes invalidate the testing methodology. I guess that's not an argument for disallowing hotfixes, but instead for disallowing hotfixed networks to be considered a greenlight for releasing.

100% agree for daemon related hotfixes. Many hotfixes to sidecar services (for example changing the mechanism that the points service pushes to Google Cloud Storage buckets, or the rate at which our txn agents produce txns) are mostly independent of network correctness.

I will make the issue title and description more explicit to make it clear that this (deploying side-car services without restarting the network) is the problem we want to solve, and not updating daemons.

yourbuddyconner commented 4 years ago

The philosophical problem with this is that then we're not actually testing what we're releasing anymore.

Big this.

I agree with the need to be able to support updates of sidecars and services, this would be easy if we hadn't shoehorned the faucet/bots as sidecars in the block producer chart and instead put them in their own helm charts.

I also agree with the need to hotfix production networks, however this deployment model doesn't currently have a differentiation between QA and Prod infrastructure, much less incremental deploys. As a result, I think any consideration as to the format and structure of the deployment should be deferred until the question of environments is answered.

That said, in a QA context, which is what the current deployment model was designed to be, hotfixing a network is functionally equivalent to redeploying it from genesis. And like @emberian said, it's very important that we do any validation on a hotfix using a fresh network.

In the short term, I disagree with the characterization that we shouldn't invest in our current helm architecture, in fact, there's probably an intermediate step here between vanilla-helm and Dhall-helm lands.

We should do the following:

Refactor the current helm charts such that services exist as separate deployments and can be deployed separately to testnet infrastructure.
Helm charts should support rolling updates, this will involve refactoring to StatefulSet for at least the block producers, but maybe other coda deployables as well.
Terraform should just pass variables and glue things together, and helm will do the heavy lifting in regards to making the deployment match the manifest.

This unravelling of the helm charts will reduce overall chart complexity and from there we will have a solid framework of functionality to build our Dhall libraries with. There will undoubtedly be some wrangling of go templates, but the overall reduction in complexity will be worth it.

bkase commented 4 years ago

This unraveling of the helm charts will reduce overall chart complexity and from there we will have a solid framework of functionality to build our Dhall libraries with. There will undoubtedly be some wrangling of go templates, but the overall reduction in complexity will be worth it.

Great! I didn't realize we had an intermediate step here. This seems like a good approach. I'm glad we're not blocked by the dhall transition for this sort of solution.

Refactor the current helm charts such that services exist as separate deployments and can be deployed separately to testnet infrastructure.

It seems like this is sufficient to solve our current iteration issues in the short term, and this is an atomic step towards the world in which we use StatefuleSets for the block producers to enable prod hotfixes. In other words, these are two different tasks and we could do a release with only the former, for example. Am I missing anything there?

Of course, I imagine we'll want to not include either of these changes for this upcoming release, and we'll want both the separated deployments and rolling update support for block producers for the one afterward.

yourbuddyconner commented 4 years ago

Am I missing anything there?

No, correct assessment, if you're mostly concerned with deploying updates you could split up this work. Though, it'd be relatively easy just to get everything done at once theoretically.

O1ahmad commented 4 years ago

I agree with the need to be able to support updates of sidecars and services, this would be easy if we hadn't shoehorned the faucet/bots as sidecars in the block producer chart and instead put them in their own helm charts. ...

Totally understand the initial consolidated implementation here (due to "KISS", "YAGNI", etc principles) but agreed for sure and somewhat goes along with and fits in with the currently in progress effort to create separate Helm charts for different daemon functions (seed-node vs. snark-worker vs. archive-node).

We should do the following:

Refactor the current helm charts such that services exist as separate deployments and can be deployed separately to testnet infrastructure.

Helm charts should support rolling updates, this will involve refactoring to StatefulSet for at least the block producers, but maybe other coda deployables as well.

Terraform should just pass variables and glue things together, and helm will do the heavy lifting in regards to making the deployment match the manifest.

preachit & agreed on all of the above @yourbuddyconner

O1ahmad commented 4 years ago

Also, would be pretty sweet (from both a general usage & product marketing and documentation perspective) to get our Helm charts released to Helm Hub and perhaps provide a more easily accessible/digestible package of Coda components for random interested parties to toy with or :hammer: on.

MinaProtocol / mina

Investigate an incremental deploy approach for sidecar-services #4946

Problem

Artifact

Clarification

preachit & agreed on all of the above @yourbuddyconner