Improving deployment process for infrastructure #420

Open brianraymor opened 5 years ago

This is an informational epic to track work associated with improving the deployment process. Objectives described in this ticket have actionable epics associated with them.

The work described in this epic is was prioritized in Q3 but slipped to Q4. The body of work was large in Q3 and given the volume of other work was perhaps too large for the quarter without focused, uninterrupted effort. Tickets are on track for completion during Q4 as MUST HAVE work with exception of objective 4 (stretch goal), which has been prioritized as NICE TO HAVE work in Q4.

User Story

As a DCP developer, I can promote a change through the full deployment process, which includes deployment across relevant environments, with minimal risk of causing downstream breakages so that day-to-day development is not a burden.

Demoable Criteria

GitLab contains a deployment for each DCP component; deployment runbooks for each component

Success Criteria

(1) All deployments be done on GitLab. This will capture all deployment results in one tool. By virtue of having all deployments done on GitLab, we will have logged deployment results. dcp/77 2) All configuration changes are in gitlab and no configuration is being done manually (product of dcp/77) 3) We have runbooks describing the deployment process to anyone who would try to deploy any component. dcp/68 4) (stretch) We have code coverage for our unit and integration tests so we can identify our holes in testing. This will be necessary if we want to start moving towards a more CI/CD setup. This work is ~a stretch goal for Q3, so may shift to Q4 work~ is now considered NICE TO HAVE in Q4 but is not prioritized. It may or may not be completed during Q4. dcp/524

I believe this issue requires considerable discussion in tech-arch as to what it practically means and its scope.

Well, let's start with: what the issues are with the current deployment process? I'm not saying there aren't any, but we do however need a clear description of the areas that need improvement / problems that we are trying to overcome. Perhaps whomever thought up this epic in the first place could take a stab at that.

For one, there does not appear to be any instructions on how to spin up my own, private test DCP. At least nothing that can be found in the "DCP Table of contents".

So the first issue is it is not clear how the system is deployed

make test-dcp

@diekhans that's a fun one. It sounds like a good idea, but I don't think the cost-benefit analysis bears out. It would take an immense amount of work and may not even be feasible with our current PaaSes and 3rd party integrations (think Cromwell). I have created instructions that allow you to create your own Upload Service, but even that alone requires a fair amount of manual steps, registering domain names, create new SSL certificates etc, and Upload is probably the simplest service. If we had a homogeneous architecture (say, every component was just a bunch of Docker images) then this might be possible, but our reality today is very complex.

I think a good goal might be: any sufficiently technically savvy person (who is not on the component team in question), with sufficient authorizations in the appropriate PaaSes, can follow a set of instructions to bring up a component, as a new environment within the *.data.humancellatlas.org domain. This is a first step in basic disaster recovery / business continuance.

I find this ticket confusing. The title says 'deployment' which to me meant deploying a full environment (as also said in the success criteria). Yes there could be a really long doc with a huge number of manual steps but I don't think this is going to be that useful except as the first necessary step to something more automatic as it's very quickly going to fall out of date and nobody will ever bother updating it. An individual developer being able to deploy everything would be a dream but almost certainly something not achievable in Q3. I think this is a good argument for spending time working towards a more homogeneous architecture. This will require a lot of CI/CD work for some components (e.g. ingest) but frankly is something that needs to be done sooner rather than later.

On the other hand, other parts of the description can be interpreted as just talking about deployment of a particular change.

I think we need a discussion in tech-arch with Sarah of what we want in this area in Q3 and alter the ticket to reflect that, rather than trying to interpret some Bible text which I think Sarah told me she didn't write herself :). Conveniently she has already scheduled a slice of time for at least agreeing an approach to properly discuss in the Friday tech-arch.... I think this is also another strong argument for having a dedicated system team or at least dedicated people time.

Maybe we have circled back to the shared machine room where you to sign up for exclusive computer time?

So we have could have a set of test DCPs that we reserve for a period of time and a program to reset the state.

I think even being able to test ingest to data browser without analysis would be helpful. If it is cost effective is another matter.

I think there are two parts to the deployment process. This ticket seems to specifically only mention infrastructure. Is there a separate ticket for deployment of application code? Anyways, I'll include both in the breakdown below.

Deployment of infrastructure 1) (baseline) Infrastructure as code practice is followed in explicitly defining all required resources for a component in terraform, aws cloudformation templates, google cloud templates, etc 2) (baseline) Process for deploying a new environment or updating existing environment infrastructure within an HCA account is documented. 3) (stretch) Any developer can clone the repository and follow the instructions to deploy into their own cloud account (AWS/GCP)

Deployment of application code 1) (baseline) All application deployments for all environments should run within the HCA Gitlab box for centralized deployments 2) (baseline) All secrets should be stored in AWS secrets manager, AWS params store, or cloud secrets store of your choice and injected during application deployment. No local copies of secrets should be passed around 3) (baseline) Each component should have documented release instructions in gitlab with a running list of FAQs to help the release engineer resolve issues that arise. 3) (stretch) Component can be promoted to the next environment with the click of the button. This will eventually allow for the release manager to deploy all components without the active presence of a deployment engineer (other than to be on-call if something goes wrong)

@justincc @parthshahva I agree this can be confusing. As written in the description, this epic is about automated turnkey deployment of entire personal DCP environments, which I think is the wrong goal for Q3. Instead, for Q3 we should work on making deployment of incremental application code and infrastructure changes as smooth and automatic as possible.

I agree with @parthshahva's breakdown, except I would like Deployment of application code - 4 (one-click promotion+deployment) to be a requirement.

We currently have too much manual work involved in promoting components. Engineers from each component are required to be present and to personally execute the promotion commands. That's not a sustainable model: it makes for a very burdensome coordination exercise that takes time away from engineers who are already spread too thin.
Instead, the release manager should be able to perform the one-click release for each component on auto-pilot unless that component is indicated in the release notes as requiring a manual step.
Release engineers for each component then only need to be on call, and not actively participating in the process.
As implemented by the Data Store and Query Service (and others I may not be aware of), the component-specific integration tests that follow a release of a component indicate whether the release was successful. The release manager's responsibility should be to monitor the integration tests, follow up with the component release engineer and roll back the deployment if the integration tests fail.

I agree with @sampierson that the calculus does not work out in support of on-demand turnkey complete DCP environments. While we want to make the infrastructure bring-up as automated as possible (and Terraform takes us 90%+ of the way there), it still requires significant time, resources, and manual work to bring up multiple services together in this way. We should work on reducing that overhead, but I'm skeptical about the value of achieving this in Q3, and this work requires significant investment in making the DCP services' configuration and deployment practices more uniform, bringing us right back to automating incremental deployments of application code.

@diekhans what is the specific test scenario where you require a complete private test environment? I suggest we figure out the testing needs attendant in that scenario, and formulate a test strategy using our existing dev and integration environments.

As implemented by the Data Store and Query Service (and others I may not be aware of), the component-specific integration tests that follow a release of a component indicate whether the release was successful. The release manager's responsibility should be to monitor the integration tests, follow up with the component release engineer and roll back the deployment if the integration tests fail.

Ingest needs significant work to get to the same place, not least because we have many interacting microservices. In an ideal world I would love to have that as part of this Q3 epic. However, strictly speaking it may be separable from the infrastructure an application code deployment breakdown outlined by @parthshahva.

@diekhans what is the specific test scenario where you require a complete private test environment? I suggest we figure out the testing needs attendant in that scenario, and formulate a test strategy using our existing dev and integration environments.

Specifically testing during development. How does one develop something that requires multiple components without impacting other developers?

How does one develop something that requires multiple components without impacting other developers?

By defining an API between the components
By pointing the in-development version of a component at the integration or staging versions of other components

To reiterate - I agree that we should work on reducing the overhead of standing up a personal DCP, but that is a hard goal, it will require a lot of integration and maintenance, it will be overkill in many developers' cases, and it won't be achievable as soon or as easily as other CI/CD objectives under discussion.

I think we must get to the point where incremental CI/CD in existing environments can be fully automated. By not doing so we rob our developers of precious time and focus. If Ingest needs work to get there, then we all need to go help Ingest. With that said, other components (both K8S based and on other stacks) have successfully leveraged GitLab and Terraform to make it happen, and I think the lift for the remaining components is comparatively much easier.

I completely agree with that CI/CD is very important and achievable. This kind of environments tends to be time-consuming to set if one doesn't have experience. It would be great to have experienced people assist with gitlab/terraform and provide best practices guidance.

Can we create a tiger team to both assist groups in added CI/CD?

The spike blocking this objective has completed and resulted in the following goals for this Q3 objective:

(1) All deployments are done in GitLab. This will capture all deployment results in one tool. By virtue of having all deployments done on GitLab, we will have logged deployment results. (2) All configuration changes are in GitLab; no configuration is being done manually (3) We have runbooks describing the deployment process for each component to anyone who would try to deploy any component (4) (stretch) We have code coverage for our unit and integration tests so we can identify our holes in testing. This will be necessary if we want to start moving towards a more CI/CD setup. This work is a stretch goal for Q3, so may shift to Q4 work.

I've updated the description in this ticket to reflect this.

Given that we succeed or fail as a team, support from the entire team and cross-team collaborations are expected achieve this goal.

Linking to spreadsheet to track progress: https://docs.google.com/spreadsheets/d/1W1ecrFP7NTny7Fm553nuW4JrIb6jM3bEgNafX-eIddM/edit#gid=0

This work is expected to slip into Q4 as carryover work. This information has been added to the epic description and release + milestones adjusted accordingly.

This epic has been re-scoped as an informational epic to track work associated with improvement to deployment. Individual epics are open for each objective. Tickets associated with code coverage and capturing code coverage metrics have been re-aligned to a separate, iceboxed epic since this is a Q4 stretch goal and the DCP is not ready to stretch.

I am meeting with @prabh-t about the remaining issue in moving all deployments to gitlab today and will close that ticket when this is complete.

HumanCellAtlas / dcp