Open sastels opened 11 months ago
Completed the Terraform migration, need to do the same on Karpenter but our Karpenter is completely out of date. I will have to update Karpenter before proceeding.
need to make changes to karpenter ... need to write steps on how to do the migration. Gotta stawrt debugging fluentbit with IMDSv2
helmfile migration steps were done yesterday and the PR is ready for review. onto fluentbit tuning. need a review on the PR.
Ben created the documentation yesterday. still need to implement the actual changes.
Ben will supervise Pond on some of this implementation work on staging.
Maybe we can slot this for Thursday group session
First steps completed in Staging during group session. Need to do prod releases for next steps, so we'll do that Monday
Pond isn't in today but Ben will proceed with next steps
Description
As an operator of GC Notify I want our K8s nodes to run on IMDSv2 to ensure that the system is free of tech debt while also increasing security and reliability.
As a (user), I need to be able to do (X) so that I can achieve (Y) outcome.
As a developer of Notify, I want our infrastructure to be running on the latest best practices for AWS, as we are currently watching version 1 of the IMDS but we need to be on version 2, So that I am on most secure configuration possible.
AWS will be deprecating IMDSv1 mid 2024, and will force us to move to IMDSv2. It would be best to do this before then to ensure there are no issues in
WHY are we building?
More security. Better maintenance of our system. Less friction with different components of our system.
AWS will be deprecating IMDSv1 mid 2024, and will force us to move to IMDSv2. It would be best to do this before then to ensure there are no issues in migration.
This will increase security and has been recommended to us by the AWS TAM for the last two years or so.
WHAT are we building?
Our system initially supported IMDSv2 with karpenter but the rest of the system was not compatible with it. Hence we went to version 1 with an explicit configuration but we want to get on version 2 for the extra security.
Creating EC2 launch template with IMDSv2 enabled in Terraform, and reference that in the EKS cluster (Pat has been working on this) Modifying the log pipelines in fluent bit to use IMDSv2 (Will need to be coordinated with release of the above) Modifying Karpenter to either use IMDSv2 or the launch template from terraform (Will also need to be coordinated with above)
VALUE created by our solution
More security. Better maintenance of our system. Less friction with different components of our system.
Acceptance Criteria
[x] Primary EKS nodes are using IMDSv2
[x] Karpenter is configured to use IMDSv2
[x] The application still work as expected, with all regression tests being successful (smoke tests, unit tests, manual soak test). Perform the rollouts and tests in dev, staging and then production in that order.
[x] Success metrics are compliant
[x] Fluent Bit pipelines have been updated to IMDSv2
Measuring success and metrics
[x] Deployment of new nodes are 100% successful with no related errors in the logs.
[x] Deployment of new pods are 100% successful with no related errors in the logs.
[x] Spot instances deployment are 100% successful with no related errors in the logs.
[x] Fluent bit is working for on-demand and spot instances; this is the component that previously broke in our incident.
QA Steps
Additional Information