NASA-IMPACT / admg-casei

ADMG Inventory
https://impact.earthdata.nasa.gov/casei/
Apache License 2.0
1 stars 0 forks source link

Discussion Issue: Post-mortem: CI/CD Pipeline Issue (5/11/23 - 5/22/23) #522

Closed naomatheus closed 1 year ago

naomatheus commented 1 year ago

Post-mortem: CI/CD Pipeline Issue (5/11 - 5/22)

Summary

This post-mortem outlines a significant issue we encountered with our CI/CD pipeline between 5/11 and 5/22, which led to substantial disruption in our workflow. We'll discuss the solutions we've implemented to resolve this issue and how we plan to prevent similar incidents in the future.

Incident

Date

5/11 - 5/22

Duration

5 days (half sprint)

Impact

This incident caused a delay in feature deployment to the CASEI production site. The production deployment delay totaled around a half sprint (or five working days). As a result, multiple features that have been completed in the past two sprints have yet to have been deployed, and as of writing they are clear for deployment.

The impact also surfaced necessary configuration changes to the CI/CD pipeline that were overlooked, differences between local development, staging CI, and production CI. Finally, at least three team members spent time on CI instead of their own feature priorities.

Root Cause

The primary cause of the incident was a discrepancy between our CI/CD pipeline and local development environments, leading to inconsistencies in the test outcomes.

Mitigation and Resolution

Short Term Mitigation

Long Term Solutions

Lessons Learned

What went well

Where we got lucky

Future Actions

Conversations

edkeeble commented 1 year ago

to avoid scenarios like https://github.com/NASA-IMPACT/admg-casei/pull/516 where multiple features become part of a deployment history.

This is just a side effect of gitflow. I've never seen a gitflow implementation where some features didn't pile up in staging and get deployed together. We could look at alternative git workflows, but if we maintain a single staging environment, we'll either end up deploying multiple features to production at once or blocking features from going to staging until the current staging feature has been deployed to prod. We could implement multiple development environments, where features are deployed and tested in isolation, and then pushed to production independently. Then it becomes a question of where we want to test our code integration: staging or production?

edkeeble commented 1 year ago

Add pre-commit hooks so that commits to the repository raise linting issues locally.

This is not a bad idea, but in most cases will only save a few minutes of developer time. Linting in CI should fail quickly enough that it reminds the developer to go back and run/fix linting locally, anyway.

Pin specific versions of all packages

The yarn.lock file does this automatically. We should not be deleting this file unless strictly necessary.

Implement a rollback threshold

This sounds like a good idea. How would you envision it working? Deploy to production > run e2e tests on prod > catch any raised exceptions and trigger a rollback?

It should be possible to push to staging with ease and have linting issues raised, but not block deployment to non-production environments.

Are you saying we should allow merging to staging even if linting fails?

naomatheus commented 1 year ago

to avoid scenarios like #516 where multiple features become part of a deployment history.

This is just a side effect of gitflow. I've never seen a gitflow implementation where some features didn't pile up in staging and get deployed together. We could look at alternative git workflows, but if we maintain a single staging environment, we'll either end up deploying multiple features to production at once or blocking features from going to staging until the current staging feature has been deployed to prod. We could implement multiple development environments, where features are deployed and tested in isolation, and then pushed to production independently. Then it becomes a question of where we want to test our code integration: staging or production?

Yea. I think you're right that this is to be expected. What do you think of just making sure any new feature is deployed "all the way" - as in it's in dev, in staging, and in prod - so to speak. In this case I was surprised to see that spatial bounds changes and other things from before my specific work on CASEI weren't in production. So, what I'm really advocating for is more frequent deployments to production. I definitely like gitflow more than alternatives. It does get backed up sometimes, but that's to be expected sometimes.

naomatheus commented 1 year ago

The yarn.lock file does this automatically. We should not be deleting this file unless strictly necessary.

I see. I'm not as familiar with the npm and yarn toolkits. What I want to suggest is pinning the versions without the caret. For example, "gatsby-plugin-image": "2.8.1", instead of ... : "^2.8.1". I was trying to figure out why the error of "conflicting playwright versions" was surfacing when we had no two diverging versions of playwright in our package.json. I think that there was a package dependency that used a lower sub-version of playwright, and this lead to conflicts. The conflicts were in sub-dependencies explicitly defined in the lock files. If we "hard pin" the versions without caret this should reduce the likelihood of version conflicts that cause CI failures. The configuration changes needed to get playwright out to the door with this feature were substantial - and included things I would not expect and did not have to perform in my local dev environment - such as upgrading npm versions and add gatsby postcss. Would removing the carets help?

An added benefit could be being able to isolate the work of upgrading versions and adjusting to new package releases, etc.

naomatheus commented 1 year ago

Are you saying we should allow merging to staging even if linting fails?

Yes. I am aware that the suggestion might seem a bit risky, but I believe it's a conversation worth having. In this development cycle I unfortunately had to rollback some thoughtful code enhancements proposed by another dev - removed works in progress in this commit. I decided to roll back because of specific issues associated with the window object that are specific to our Github Actions CI environments. It's a WARN locally, but causes failure in a CI pipeline.

It's unfortunate to rollback improvements, and it made me realize that we might be missing out on valuable testing opportunities in real browser environments. ESLint/Jest plays a role in maintaining quality of our codebase by identifying unused variables, line spacing, etc; and that's undeniably important. However, passing ESLint isn't necessarily indicative of a feature's functionality and value for the client.

So, I propose that we allow functional features, despite minor linting errors, to progress into the staging environment. The benefits will be additional insights into post-deployment features in the very specific S3 subpath environment. This could perhaps even expedite our development process. I believe it would be beneficial for works in progress, and improvements that are stylistic or aimed at enhancing code quality to be tested in the staging environment.

naomatheus commented 1 year ago

This sounds like a good idea. How would you envision it working? Deploy to production > run e2e tests on prod > catch any raised exceptions and trigger a rollback?

For this honestly I was just thinking manually within staging. Let's say a change breaks in S3 for some reason, and it didn't break locally. Check out an older commit, push, and we redeploy the older working code to staging.

I'm all for having a very high standard for prod deployments. With all the tests and all the things.

edkeeble commented 1 year ago

I decided to roll back because of specific issues associated with the window object that are specific to our Github Actions CI environments. It's a WARN locally, but causes failure in a CI pipeline.

This feels like a configuration problem more than an issue with requiring linting to pass. Even if the solution is to ignore a specific line of code when linting, that seems preferable to no longer enforcing linting.

naomatheus commented 1 year ago

Hmm... true. There's a way forward without changing the linting as part of CI pipeline. Let's still discuss E2E testing. I like the overall strategy, but it seems like an impediment to deploying to staging to have the 100% passage of E2E be a step before deploying to staging. I'd like to have features in an S3 subpath environment, in actual usage, in browser, etc, more quickly

I have ideas, just will paste them here:

I actually think pre-commit could address a lot of these. It'd speed up my workflow personally too. I do think debugging in Actions is much more difficult and time consuming. That being said, I'm concerned about situations where E2E, Linting, and Unit Tests are passing locally, but then do not pass in CI.

naomatheus commented 1 year ago

Idea Drop: Let's say we close an Issue, the feature is done and merged in. But it hasn't been deployed (to production) yet. Should we track that somehow within the Issue, or separately in our sprint planning? For example (as of writing this), Issue #511 is done, reviewed, and merged, but I'm still currently deploying it to prod.

naomatheus commented 1 year ago

Closing this issue with the following action items which will be added to project mgmt docs:

naomatheus commented 1 year ago

Hey @edkeeble Did you have any concerns with pinning versions of dependencies? The ^1.1.1 syntax does ensure that a package won't automatically update to a new MAJOR version, but it could update to revisions or patch versions which could themselves have dependencies that have a different major version.