Closed naomatheus closed 1 year ago
to avoid scenarios like https://github.com/NASA-IMPACT/admg-casei/pull/516 where multiple features become part of a deployment history.
This is just a side effect of gitflow. I've never seen a gitflow implementation where some features didn't pile up in staging and get deployed together. We could look at alternative git workflows, but if we maintain a single staging environment, we'll either end up deploying multiple features to production at once or blocking features from going to staging until the current staging feature has been deployed to prod. We could implement multiple development environments, where features are deployed and tested in isolation, and then pushed to production independently. Then it becomes a question of where we want to test our code integration: staging or production?
Add pre-commit hooks so that commits to the repository raise linting issues locally.
This is not a bad idea, but in most cases will only save a few minutes of developer time. Linting in CI should fail quickly enough that it reminds the developer to go back and run/fix linting locally, anyway.
Pin specific versions of all packages
The yarn.lock
file does this automatically. We should not be deleting this file unless strictly necessary.
Implement a rollback threshold
This sounds like a good idea. How would you envision it working? Deploy to production > run e2e tests on prod > catch any raised exceptions and trigger a rollback?
It should be possible to push to staging with ease and have linting issues raised, but not block deployment to non-production environments.
Are you saying we should allow merging to staging even if linting fails?
to avoid scenarios like #516 where multiple features become part of a deployment history.
This is just a side effect of gitflow. I've never seen a gitflow implementation where some features didn't pile up in staging and get deployed together. We could look at alternative git workflows, but if we maintain a single staging environment, we'll either end up deploying multiple features to production at once or blocking features from going to staging until the current staging feature has been deployed to prod. We could implement multiple development environments, where features are deployed and tested in isolation, and then pushed to production independently. Then it becomes a question of where we want to test our code integration: staging or production?
Yea. I think you're right that this is to be expected. What do you think of just making sure any new feature is deployed "all the way" - as in it's in dev, in staging, and in prod - so to speak. In this case I was surprised to see that spatial bounds changes and other things from before my specific work on CASEI weren't in production. So, what I'm really advocating for is more frequent deployments to production. I definitely like gitflow more than alternatives. It does get backed up sometimes, but that's to be expected sometimes.
The yarn.lock file does this automatically. We should not be deleting this file unless strictly necessary.
I see. I'm not as familiar with the npm and yarn toolkits. What I want to suggest is pinning the versions without the caret. For example, "gatsby-plugin-image": "2.8.1",
instead of ... : "^2.8.1"
.
I was trying to figure out why the error of "conflicting playwright versions" was surfacing when we had no two diverging versions of playwright in our package.json
. I think that there was a package dependency that used a lower sub-version of playwright, and this lead to conflicts.
The conflicts were in sub-dependencies explicitly defined in the lock files. If we "hard pin" the versions without caret this should reduce the likelihood of version conflicts that cause CI failures.
The configuration changes needed to get playwright out to the door with this feature were substantial - and included things I would not expect and did not have to perform in my local dev environment - such as upgrading npm versions and add gatsby postcss.
Would removing the carets help?
An added benefit could be being able to isolate the work of upgrading versions and adjusting to new package releases, etc.
Are you saying we should allow merging to staging even if linting fails?
Yes. I am aware that the suggestion might seem a bit risky, but I believe it's a conversation worth having. In this development cycle I unfortunately had to rollback some thoughtful code enhancements proposed by another dev - removed works in progress in this commit. I decided to roll back because of specific issues associated with the window
object that are specific to our Github Actions CI environments. It's a WARN locally, but causes failure in a CI pipeline.
It's unfortunate to rollback improvements, and it made me realize that we might be missing out on valuable testing opportunities in real browser environments. ESLint/Jest plays a role in maintaining quality of our codebase by identifying unused variables, line spacing, etc; and that's undeniably important. However, passing ESLint isn't necessarily indicative of a feature's functionality and value for the client.
So, I propose that we allow functional features, despite minor linting errors, to progress into the staging environment. The benefits will be additional insights into post-deployment features in the very specific S3 subpath environment. This could perhaps even expedite our development process. I believe it would be beneficial for works in progress, and improvements that are stylistic or aimed at enhancing code quality to be tested in the staging environment.
This sounds like a good idea. How would you envision it working? Deploy to production > run e2e tests on prod > catch any raised exceptions and trigger a rollback?
For this honestly I was just thinking manually within staging. Let's say a change breaks in S3 for some reason, and it didn't break locally. Check out an older commit, push, and we redeploy the older working code to staging.
I'm all for having a very high standard for prod
deployments. With all the tests and all the things.
I decided to roll back because of specific issues associated with the window object that are specific to our Github Actions CI environments. It's a WARN locally, but causes failure in a CI pipeline.
This feels like a configuration problem more than an issue with requiring linting to pass. Even if the solution is to ignore a specific line of code when linting, that seems preferable to no longer enforcing linting.
Hmm... true. There's a way forward without changing the linting as part of CI pipeline. Let's still discuss E2E testing. I like the overall strategy, but it seems like an impediment to deploying to staging to have the 100% passage of E2E be a step before deploying to staging. I'd like to have features in an S3 subpath environment, in actual usage, in browser, etc, more quickly
I have ideas, just will paste them here:
develop
branch from deploy to staging
pipeline workflow. Right now, deploy to staging
runs when E2E testing passes. We create a branch for the deploy to staging
pipeline to be triggered after a merge to that branch. staging
as a branch?ubuntu:jammy
, a gatsby GraphQL image? This would mitigate differences between when running CI commands locally (like yarn lint, yarn test) and when they're run in CI.I actually think pre-commit could address a lot of these. It'd speed up my workflow personally too. I do think debugging in Actions is much more difficult and time consuming. That being said, I'm concerned about situations where E2E, Linting, and Unit Tests are passing locally, but then do not pass in CI.
Idea Drop: Let's say we close an Issue, the feature is done and merged in. But it hasn't been deployed (to production) yet. Should we track that somehow within the Issue, or separately in our sprint planning? For example (as of writing this), Issue #511 is done, reviewed, and merged, but I'm still currently deploying it to prod.
Closing this issue with the following action items which will be added to project mgmt docs:
Local development environment for CASEI can more closely resemble Github Actions runners by using dev containers.
Configuring a threshold for Jest unit tests
pre-commit hooks for linting procedures in CI
Pin versions in package.json
Hey @edkeeble
Did you have any concerns with pinning versions of dependencies?
The ^1.1.1
syntax does ensure that a package won't automatically update to a new MAJOR version, but it could update to revisions or patch versions which could themselves have dependencies that have a different major version.
Post-mortem: CI/CD Pipeline Issue (5/11 - 5/22)
Summary
This post-mortem outlines a significant issue we encountered with our CI/CD pipeline between 5/11 and 5/22, which led to substantial disruption in our workflow. We'll discuss the solutions we've implemented to resolve this issue and how we plan to prevent similar incidents in the future.
Incident
Date
5/11 - 5/22
Duration
5 days (half sprint)
Impact
This incident caused a delay in feature deployment to the CASEI production site. The production deployment delay totaled around a half sprint (or five working days). As a result, multiple features that have been completed in the past two sprints have yet to have been deployed, and as of writing they are clear for deployment.
The impact also surfaced necessary configuration changes to the CI/CD pipeline that were overlooked, differences between local development, staging CI, and production CI. Finally, at least three team members spent time on CI instead of their own feature priorities.
Root Cause
The primary cause of the incident was a discrepancy between our CI/CD pipeline and local development environments, leading to inconsistencies in the test outcomes.
Lint and Test
passes (unit tests), butTest E2E
(integration tests) were not passing. This went unnoticed, and we placed selective attention on passing tests.production
that was not indevelop
evidenced in merge conflicts that arose. This almost results in a "merging from prod to develop" scenario which should not be happening in a feature branch deployment management scheme. (link to this issue)deploy to staging
,deploy to production
, ande2e
testing YAML configurations.Mitigation and Resolution
Short Term Mitigation
Long Term Solutions
Lessons Learned
What went well
Where we got lucky
Future Actions
Conversations