department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
278 stars 194 forks source link

[DISCOVERY][Packer] Amazon CloudWatch Agent Implementation - Policies and Roles #85703

Open hgbarreto opened 3 weeks ago

hgbarreto commented 3 weeks ago

Description

In order to resolve the outstanding issue of EKS node volumes running out of space, we would like to implement the Amazon Cloudwatch agent to collect logs instead of storing them on the node itself.

Need to add new policies to current and new roles:

Possible solutions:

Resources

Acceptance Criteria

Refinement Guidance - Check the following before working on this issue:

Efe-Oddball commented 1 week ago

Going through the GH comments in the Closed PR and Slack discussions about this topic to understand what needs to be done

Efe-Oddball commented 1 week ago

The two images that will need to be taken into account here are the

  1. EKS node image
  2. Al2-hardened

Approach_1

Approach_2

Approach_3

Efe-Oddball commented 1 week ago

Updated some of the EC2 roles with Cloudwatch policies and with test functionality next week before updating the code and creating a PR

Efe-Oddball commented 5 days ago

I have updated all EC2 IAM profile roles within the AWS console. This includes roles connected to the legacy forward proxy, new reverse proxy as well as the new fwd proxy in dev with "ec2:describetags" policy. This also includes adding cloudwatch logging policy to the EKS nodes role IAM profile for all tiers. Working on running the Cloudwatch scripts to set up the Cloudwatch agent, then test functionality

Efe-Oddball commented 5 days ago

I am also working on updating the terraform code with all the updates I implemented within the console

Efe-Oddball commented 3 days ago

I am working on creating a cluster for the EKS updated image with the cloudwatch agent installed and then I will deploy nodes to the cluster and confirm that the logs are going to Cloudwatch correctly. Most of the terraform code that provisions the policies have been updated. This ticket will also have to roll over to the next sprint