naomatheus commented 1 year ago

Post-mortem: CI/CD Pipeline Issue (5/11 - 5/22)

Summary

This post-mortem outlines a significant issue we encountered with our CI/CD pipeline between 5/11 and 5/22, which led to substantial disruption in our workflow. We'll discuss the solutions we've implemented to resolve this issue and how we plan to prevent similar incidents in the future.

Incident

Date

5/11 - 5/22

Duration

5 days (half sprint)

Impact

This incident caused a delay in feature deployment to the CASEI production site. The production deployment delay totaled around a half sprint (or five working days). As a result, multiple features that have been completed in the past two sprints have yet to have been deployed, and as of writing they are clear for deployment.

The impact also surfaced necessary configuration changes to the CI/CD pipeline that were overlooked, differences between local development, staging CI, and production CI. Finally, at least three team members spent time on CI instead of their own feature priorities.

Root Cause

The primary cause of the incident was a discrepancy between our CI/CD pipeline and local development environments, leading to inconsistencies in the test outcomes.

Some tests are passing, but others are not. Specifically, Lint and Test passes (unit tests), but Test E2E (integration tests) were not passing. This went unnoticed, and we placed selective attention on passing tests.
As you can see here in Pull Request 508, a code review was completed and all tests appeared to be passing. However, this same PR become a blocker for other issues being addressed in the pipeline.
Code in production that was not in develop. In subsequent feature deployments there appeared to be code that was in production that was not in develop evidenced in merge conflicts that arose. This almost results in a "merging from prod to develop" scenario which should not be happening in a feature branch deployment management scheme. (link to this issue)
The Playwright E2E issue was closed too early and code review did not fully address specific updates that would be needed to the CI pipeline. These included the following configuration changes within the application: the RUN command in the CI/CD configuration YAML was added, Cypress extras were removed, Cypress was removed altogether, issues addressed in a code review were removed, conflicting Playwright versions were removed, package lock, yarn lock, and dependencies were updated, and many version and dependency conflicts were resolved.
Some features completed in previous sprints were not merged to production, thus they became part of a multi-feature commit history. This is not ideal.
Differences between the deploy to staging, deploy to production, and e2e testing YAML configurations.

Mitigation and Resolution

Short Term Mitigation

Allow passing and skipping of tests that show different results than testing with local browser environments.
Deploy one feature at a time, ensuring that it is deployed and surfacing any blockers. This approach aims to "split" development and deployment into separate tasks or at least separate expectations, to avoid scenarios like this one where multiple features become part of a deployment history.

Long Term Solutions

Add pre-commit hooks so that commits to the repository raise linting issues locally.
Pin specific versions of all packages, and consider manually building if tagging is not possible. In NPM this can be difficult to locate older versions and repositories. These issues can be avoided by pinning versions.
Implement a rollback threshold. Get features out the door, allow for bug reports, and allow for rollbacks to previous code history if a major bug is noticed.

Lessons Learned

What went well

It is good that we're taking greater ownership of deployment. We can reduce delays in deploying and have a faster, more iterative deployment process.
Through this debugging, we were able to update local deployment environments to more closely mimic the production environment of an S3 statically hosted website with a subpath. (link)
Developers immediate improvements that could be made to increase the professionalism of this feature, and then continued to suggest and implement improvements. (See this Slack thread)

Where we got lucky

When the searchbox feature was demo'd, developers mentioned that there were still some steps to be completed before deploying to production. Luckily, the client was primed for this when we noticed them of even further delays. This could have been managed better on the development side, but we got lucky here that there was a realistic expectation set.
Code reviews did suggest specific and easy-to-implement changes to increase the professionalism of the TypeAhead Searchbox feature in CASEI See unresolved comments in this PR

Future Actions

Define the standard for pushing to the staging environment. It should be possible to push to staging with ease and have linting issues raised, but not block deployment to non-production environments.
Pin all package, dependency, and testing package versions.

Conversations

Link to the discussion about the deployment pipeline.

edkeeble commented 1 year ago

to avoid scenarios like https://github.com/NASA-IMPACT/admg-casei/pull/516 where multiple features become part of a deployment history.

This is just a side effect of gitflow. I've never seen a gitflow implementation where some features didn't pile up in staging and get deployed together. We could look at alternative git workflows, but if we maintain a single staging environment, we'll either end up deploying multiple features to production at once or blocking features from going to staging until the current staging feature has been deployed to prod. We could implement multiple development environments, where features are deployed and tested in isolation, and then pushed to production independently. Then it becomes a question of where we want to test our code integration: staging or production?

edkeeble commented 1 year ago

Add pre-commit hooks so that commits to the repository raise linting issues locally.

This is not a bad idea, but in most cases will only save a few minutes of developer time. Linting in CI should fail quickly enough that it reminds the developer to go back and run/fix linting locally, anyway.

Pin specific versions of all packages

The yarn.lock file does this automatically. We should not be deleting this file unless strictly necessary.

Implement a rollback threshold

This sounds like a good idea. How would you envision it working? Deploy to production > run e2e tests on prod > catch any raised exceptions and trigger a rollback?

It should be possible to push to staging with ease and have linting issues raised, but not block deployment to non-production environments.

Are you saying we should allow merging to staging even if linting fails?

naomatheus commented 1 year ago

to avoid scenarios like #516 where multiple features become part of a deployment history.

This is just a side effect of gitflow. I've never seen a gitflow implementation where some features didn't pile up in staging and get deployed together. We could look at alternative git workflows, but if we maintain a single staging environment, we'll either end up deploying multiple features to production at once or blocking features from going to staging until the current staging feature has been deployed to prod. We could implement multiple development environments, where features are deployed and tested in isolation, and then pushed to production independently. Then it becomes a question of where we want to test our code integration: staging or production?

Yea. I think you're right that this is to be expected. What do you think of just making sure any new feature is deployed "all the way" - as in it's in dev, in staging, and in prod - so to speak. In this case I was surprised to see that spatial bounds changes and other things from before my specific work on CASEI weren't in production. So, what I'm really advocating for is more frequent deployments to production. I definitely like gitflow more than alternatives. It does get backed up sometimes, but that's to be expected sometimes.

naomatheus commented 1 year ago

The yarn.lock file does this automatically. We should not be deleting this file unless strictly necessary.

I see. I'm not as familiar with the npm and yarn toolkits. What I want to suggest is pinning the versions without the caret. For example, "gatsby-plugin-image": "2.8.1", instead of ... : "^2.8.1". I was trying to figure out why the error of "conflicting playwright versions" was surfacing when we had no two diverging versions of playwright in our package.json. I think that there was a package dependency that used a lower sub-version of playwright, and this lead to conflicts. The conflicts were in sub-dependencies explicitly defined in the lock files. If we "hard pin" the versions without caret this should reduce the likelihood of version conflicts that cause CI failures. The configuration changes needed to get playwright out to the door with this feature were substantial - and included things I would not expect and did not have to perform in my local dev environment - such as upgrading npm versions and add gatsby postcss. Would removing the carets help?

An added benefit could be being able to isolate the work of upgrading versions and adjusting to new package releases, etc.

naomatheus commented 1 year ago

Are you saying we should allow merging to staging even if linting fails?

Yes. I am aware that the suggestion might seem a bit risky, but I believe it's a conversation worth having. In this development cycle I unfortunately had to rollback some thoughtful code enhancements proposed by another dev - removed works in progress in this commit. I decided to roll back because of specific issues associated with the window object that are specific to our Github Actions CI environments. It's a WARN locally, but causes failure in a CI pipeline.

It's unfortunate to rollback improvements, and it made me realize that we might be missing out on valuable testing opportunities in real browser environments. ESLint/Jest plays a role in maintaining quality of our codebase by identifying unused variables, line spacing, etc; and that's undeniably important. However, passing ESLint isn't necessarily indicative of a feature's functionality and value for the client.

So, I propose that we allow functional features, despite minor linting errors, to progress into the staging environment. The benefits will be additional insights into post-deployment features in the very specific S3 subpath environment. This could perhaps even expedite our development process. I believe it would be beneficial for works in progress, and improvements that are stylistic or aimed at enhancing code quality to be tested in the staging environment.

naomatheus commented 1 year ago

This sounds like a good idea. How would you envision it working? Deploy to production > run e2e tests on prod > catch any raised exceptions and trigger a rollback?

For this honestly I was just thinking manually within staging. Let's say a change breaks in S3 for some reason, and it didn't break locally. Check out an older commit, push, and we redeploy the older working code to staging.

I'm all for having a very high standard for prod deployments. With all the tests and all the things.

edkeeble commented 1 year ago

I decided to roll back because of specific issues associated with the window object that are specific to our Github Actions CI environments. It's a WARN locally, but causes failure in a CI pipeline.

This feels like a configuration problem more than an issue with requiring linting to pass. Even if the solution is to ignore a specific line of code when linting, that seems preferable to no longer enforcing linting.

naomatheus commented 1 year ago

Hmm... true. There's a way forward without changing the linting as part of CI pipeline. Let's still discuss E2E testing. I like the overall strategy, but it seems like an impediment to deploying to staging to have the 100% passage of E2E be a step before deploying to staging. I'd like to have features in an S3 subpath environment, in actual usage, in browser, etc, more quickly

I have ideas, just will paste them here:

Decouple develop branch from deploy to staging pipeline workflow. Right now, deploy to staging runs when E2E testing passes. We create a branch for the deploy to staging pipeline to be triggered after a merge to that branch. staging as a branch?
Local development environment for CASEI can more closely resemble Github Actions runners.
- Add a specific container to Github Actions runs. ubuntu:jammy, a gatsby GraphQL image? This would mitigate differences between when running CI commands locally (like yarn lint, yarn test) and when they're run in CI.
Configuring a threshold for Jest unit tests
- https://jestjs.io/docs/configuration#coveragethreshold-object
pre-commit hooks (mentioned above)

I actually think pre-commit could address a lot of these. It'd speed up my workflow personally too. I do think debugging in Actions is much more difficult and time consuming. That being said, I'm concerned about situations where E2E, Linting, and Unit Tests are passing locally, but then do not pass in CI.

naomatheus commented 1 year ago

Idea Drop: Let's say we close an Issue, the feature is done and merged in. But it hasn't been deployed (to production) yet. Should we track that somehow within the Issue, or separately in our sprint planning? For example (as of writing this), Issue #511 is done, reviewed, and merged, but I'm still currently deploying it to prod.

naomatheus commented 1 year ago

Closing this issue with the following action items which will be added to project mgmt docs:

Local development environment for CASEI can more closely resemble Github Actions runners by using dev containers.
- Add a specific container to Github Actions runs. ubuntu:jammy, a gatsby GraphQL image? This would mitigate differences between when running CI commands locally (like yarn lint, yarn test) and when they're run in CI.
Configuring a threshold for Jest unit tests
- https://jestjs.io/docs/configuration#coveragethreshold-object
pre-commit hooks for linting procedures in CI
Pin versions in package.json

naomatheus commented 1 year ago

Hey @edkeeble Did you have any concerns with pinning versions of dependencies? The ^1.1.1 syntax does ensure that a package won't automatically update to a new MAJOR version, but it could update to revisions or patch versions which could themselves have dependencies that have a different major version.

NASA-IMPACT / admg-casei

Discussion Issue: Post-mortem: CI/CD Pipeline Issue (5/11/23 - 5/22/23) #522

Post-mortem: CI/CD Pipeline Issue (5/11 - 5/22)

Summary

Incident

Date

Duration

Impact

Root Cause

Mitigation and Resolution

Short Term Mitigation

Long Term Solutions

Lessons Learned

What went well

Where we got lucky

Future Actions

Conversations