aws-ia / terraform-aws-control_tower_account_factory

AWS Control Tower Account Factory
Apache License 2.0
640 stars 441 forks source link

Costly DynamoDB backup vault configuration #295

Closed smokentar closed 8 months ago

smokentar commented 1 year ago

Terraform Version & Prov:

AFT Version: 1.6.6

Terraform Version & Provider Versions Not applicable

Bug Description The configuration of the backup vault for the account hosting the vending pipeline is accruing unnecessary cost.

Config reference: aft-account-request-framework/backup.tf

This creates a backup vault aft-controltower-backup-vault and a backup plan aft_controltower_backup_rule. These are responsible for backing up the DynamoDB tables that trigger the account-vending pipeline.

The rule backs up the tables every hour however doesn't set an expiration date for the recovery points. This results in recovery points piling up in the backup vault. Additionally, this increases the items scoped in AWS Config.

Because of that AWS charges will steadily increase every single day which can grow into a beautiful bill if unnoticed.

Expected behavior The recovery points have an expiration date.

Additional context

Recovery points

image

Backup rule

image

AWS Config

image
smokentar commented 1 year ago

I'm currently deleting recovery points older than 10 days using the below script:

aws backup list-recovery-points-by-backup-vault --backup-vault-name aft-controltower-backup-vault --by-resource-type "DynamoDB" --by-created-before $(date -d "-10 days" +%Y-%m-%d) | jq -r '.RecoveryPoints[].RecoveryPointArn' | xargs -I {} aws backup delete-recovery-point --backup-vault-name aft-controltower-backup-vault --recovery-point-arn {}

This is very slow but it's better than nothing - I haven't found a way or API to delete in bulk and adding concurrency just makes things worse due to rate limiting (I presume).

adam-daily commented 1 year ago

Hey Dimitar, thanks for the detail; I'll add this info to our internal tracking for ways to reduce the cost of AFT.

rikturnbull commented 1 year ago

We also encountered an unusual spike in costs on 4 Jan 2023. AWS Config suddenly registered a change to all our recovery points and generated costs against the EU-ConfigurationItemRecorded usage type. We didn't make any changes. This is currently with AWS Support. The number of recovery points does seem a bit excessive, an automatic tidy up would be great.

CalvinRodo commented 1 year ago

We had this happen to us as well, our config bill spiked by about 80 bucks yesterday when it's normally pennies a day.

Not quite sure why it spiked yesterday and not earlier.

CalvinRodo commented 1 year ago

I opened a PR understanding that you are not accepting contributions to this product but this is a big pain in the butt to fix and we'd rather not fork your module and deal with managing drift while we wait for you to fix. I'm just demonstrating how small of a fix it is.

CalvinRodo commented 1 year ago

Also would be nice if the period is adjustable I personally don't need a backup every hour.

Flydiverny commented 1 year ago

Interesting! We also recently saw a spike in EUN1-ConfigurationItemRecorded in our AFT account which we don't really understand either. Our spike occurred on Jan 11th and landed on $80.33 where typical daily charge is ~$0.30. Not sure if this is related to this issue tho? 🤔

Regarding the backup vault config, it does indeed seem bad to be ever growing, but looking into our costs it seems to be very small charge even with current 27000 recovery points. It should still be fixed tho 😄

Menahem1 commented 1 year ago

I have also the x10 increase billing on AFT Account for Config, based on the news Recorvery Point for AWS Backup on AWS Config was activated in ...2021

o6uoq commented 1 year ago

I have the same issue. Bill spiked. Contacted AWS Support. I don't think they have context/understanding of AWS AFT for Terraform. He asked me to delete the recorder, which I've done via:

❯ aws configservice delete-configuration-recorder --configuration-recorder-name aws-controltower-BaselineConfigRecorder

o6uoq commented 1 year ago

I'm currently deleting recovery points older than 10 days using the below script:

aws backup list-recovery-points-by-backup-vault --backup-vault-name aft-controltower-backup-vault --by-resource-type "DynamoDB" --by-created-before $(date -d "-10 days" +%Y-%m-%d) | jq -r '.RecoveryPoints[].RecoveryPointArn' | xargs -I {} aws backup delete-recovery-point --backup-vault-name aft-controltower-backup-vault --recovery-point-arn {}

This didn't work for me. Something something to do with macOS and date. This is in a PoC environment so I just nuked all the recovery points by running:

for i in `aws backup list-recovery-points-by-backup-vault --backup-vault-name aft-controltower-backup-vault --by-resource-type "DynamoDB" | jq -r '.RecoveryPoints[].RecoveryPointArn'` ; do aws backup delete-recovery-point --backup-vault-name aft-controltower-backup-vault --recovery-point-arn $i ; done

HTH! 🙏🏼

CalvinRodo commented 1 year ago

@balltrev Any update on when this will be addressed, it's a pretty trivial fix to just put a limit on the number of backups. This would just save us the hassle of having to go in regularly and delete the old entries.

morganrowse commented 1 year ago

Hi all,

A quick warning on another foot-gun introduced by the AFT team.

Please make sure that when you run the above scripts to delete the recovery points, you disable AWS config recording or you will be billed per deletion API operation. Our AWS Config cost to delete the backup vault came out to $58 to delete 18,000 recovery events.

This is on top of the $20 a month for AWS config to record these DynamoDB backups.

Edit: We followed the below linked guide to stop AWS Config recording details about recovery events across all of our accounts. There are other ways to do this, but this is the way we disabled backup vault recovery points from being tracked by AWS Control Tower managed AWS Config.

By removing the AWS::Backup::RecoveryPoint from the Cloudformation parameters it will disable recording such deletion events, avoiding the potential of a $58 lesson brought to you by the AFT team.

https://aws.amazon.com/blogs/mt/customize-aws-config-resource-tracking-in-aws-control-tower-environment/

Here is a slightly better command with output so that you are sure its deleting things.

for i in `aws backup list-recovery-points-by-backup-vault --backup-vault-name aft-controltower-backup-vault --by-resource-type "DynamoDB" | jq -r '.RecoveryPoints[].RecoveryPointArn'` ; do echo "Deleting ${i}"; aws backup delete-recovery-point --backup-vault-name aft-controltower-backup-vault --recovery-point-arn $i ; done

We did however find it faster to use the AWS Console to delete many at a time, set rows per page to 100, click the next page, then select 100 more and so on. Then select the delete option. Doing it with the UI seems to delete 3 to 4 at a time which is much faster than the suggested script (if you need things done quickly).

o6uoq commented 1 year ago

@morganrowse might be helpful / valuable to add steps on how to disable AWS Config recording for AFT and subsequent steps, as reference for those who are similar situations and find this GitHub issue.

Sebelino commented 1 year ago

I ended up writing a script that helps me toggle the recorder on/off in the AFT management account. There is a service control policy managed by Control Tower which prevents stopping the recorder directly, so we have to temporarily detach the relevant SCP in the management account, then stop the recorder in the AFT management account, then re-attach the SCP in the management account.

#!/usr/bin/env bash

set -Eeuo pipefail

OU_NAME="Infrastructure"  # Organizational Unit of the AFT management account
MANAGEMENT_ACCOUNT_PROFILE="my-management-account-aws-profile"
TARGET_ACCOUNT_PROFILE="my-aft-management-account-aws-profile"

set -x

export AWS_PROFILE="$MANAGEMENT_ACCOUNT_PROFILE"
aws sso login

root_id="$(aws organizations list-roots | jq -r '.Roots[].Id')"
ou_id="$(aws organizations list-organizational-units-for-parent --parent-id "$root_id" | jq -r ".OrganizationalUnits[] | select(.Name == \"${OU_NAME}\").Id")"
ou_policies="$(aws organizations list-policies-for-target --filter SERVICE_CONTROL_POLICY --target-id $ou_id | jq -r '.Policies[].Id')"

set +e # Needed since we are checking exit code
while IFS= read -r policy_id ; do
    aws organizations describe-policy --policy-id "$policy_id" | jq -r '.Policy.Content' | grep -q "config:StopConfigurationRecorder"
    if [ $? -eq 0 ]; then
        sought_policy_id="$policy_id"
        break
    fi
done <<< "$ou_policies"
set -e

aws organizations detach-policy --policy-id "$sought_policy_id" --target-id "$ou_id"

export AWS_PROFILE="$TARGET_ACCOUNT_PROFILE"
aws sso login

recorder="$(aws configservice describe-configuration-recorders | jq -r '.ConfigurationRecorders[].name')"
is_recorder_enabled="$(aws configservice describe-configuration-recorder-status | jq '.ConfigurationRecordersStatus[].recording')"

if [ "$is_recorder_enabled" = "true" ]; then
    aws configservice stop-configuration-recorder --configuration-recorder-name "$recorder"
else
    aws configservice start-configuration-recorder --configuration-recorder-name "$recorder"
fi

export AWS_PROFILE="$MANAGEMENT_ACCOUNT_PROFILE"
aws sso login

aws organizations attach-policy --policy-id "$sought_policy_id" --target-id "$ou_id"

After toggling the recorder off, I was able to delete the recovery points per @morganrowse's suggestion without accumulating extra costs.

ahkai86 commented 11 months ago

I'm currently deleting recovery points older than 10 days using the below script:

aws backup list-recovery-points-by-backup-vault --backup-vault-name aft-controltower-backup-vault --by-resource-type "DynamoDB" --by-created-before $(date -d "-10 days" +%Y-%m-%d) | jq -r '.RecoveryPoints[].RecoveryPointArn' | xargs -I {} aws backup delete-recovery-point --backup-vault-name aft-controltower-backup-vault --recovery-point-arn {}

This is very slow but it's better than nothing - I haven't found a way or API to delete in bulk and adding concurrency just makes things worse due to rate limiting (I presume).

recovery_point_arns=$(aws backup list-recovery-points-by-backup-vault --backup-vault-name aft-controltower-backup-vault --by-resource-type "DynamoDB" --by-created-before $(date -d "-20 days" +%Y-%m-%d) | jq -r '.RecoveryPoints[].RecoveryPointArn')

for recovery_point_arn in $recovery_point_arns; do

    aws backup delete-recovery-point --backup-vault-name aft-controltower-backup-vault --recovery-point-arn $recovery_point_arn

done

I am running this bash under aws cloudshell terminal and it works as well, thought slow but at least deleted the Recovery Points in the Backup vault.

simon97k commented 11 months ago

Our AWS config bill also spiked due to this fact. Interesting is that it also happend when we reached about 27k backups similar to @Flydiverny

Does someone had a spike in Configs costs when this amount of backups is reached?

PeterBengtson commented 11 months ago

We also have 27000 backup points. This surely must be a bug.

On Fri, 8 Dec 2023 at 15:03, Simon K @.***> wrote:

Our AWS config bill also spiked due to this fact. Interesting is that it also happend when we reached about 27k backups similar to @Flydiverny https://github.com/Flydiverny

Does someone had a spike in Configs costs when this amount of backups is reached?

— Reply to this email directly, view it on GitHub https://github.com/aws-ia/terraform-aws-control_tower_account_factory/issues/295#issuecomment-1847217930, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAA6OLZJGOF62NBBORMGZ3YIMM4ZAVCNFSM6AAAAAATZAYNMWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNBXGIYTOOJTGA . You are receiving this because you are subscribed to this thread.Message ID: <aws-ia/terraform-aws-control_tower_account_factory/issues/295/1847217930@ github.com>

mikeantonelli commented 10 months ago

This issue occurred for us as well, and I caught it because of a billing alert that showed a charge for ~$75. In AWS Cost Explorer we have hourly granularity, and all charges were associated with AWS Config in the AFT Management account.

In my case Configuration.VaultType was added for all AWS Backup Recovery point resources being monitored by AWS Confg. If this project's defaults are left in place, this is 96 recovery points per day and ~35,000 per year. In us-east-2, each continuous configuration item delivered costs $0.003. This totals $105 for an attribute change to a year's worth of DynamoDB backups for this project. The recommendations in this thread to change your retention policy or backup frequency should be considered with this in mind.

Looking at the documentation history for AWS Backup, my hypothesis is that the team responsible for AWS Backup pushed a change that caught my account and caused all my historical backups to update this attribute. I hope the team responsible for this project considers adding input attributes supporting customization of the retention and frequency of backups.

In case this happens to others I'll include more details below on how I found that this new attribute was added -- I'd be interested if others saw the same in their environments and/or if other dates in the AWS Backup documentation history correlate to experiences of others.

Investigation Details In the AFT Management account the AWS Config dashboard showed a spike on 2024-01-10 between the hours of 19:00 and 23:00 UTC in Configuration Items Recorded. The total number of items was ~25,000, and we have a similar number of AWS Backup Recovery points in the aft-controltower-backup-vault. In reviewing the AWS Backup Dashboard, there was no change to the number of jobs over the last week so I decided to take a look at AWS Config History. A Configuration Item is recorded each time a resource is modified, so the number of items being so close to the number of recovery points made me think there was a change pushed for the entire AWS Backup Vault. In the log-archive account, there is an AWS S3 bucket with AWS Control Tower logs - it follows the convention of aws-controltower-logs-xxx-yyy, where xxx is the log-archive account id and yyy is the region of interest. Inside this folder, if you navigate into the directory structure you'll discover a set of AWS Config History logs. * s3://aws-controltower-logs-``-``/``/AWSLogs/``/Config/``/ After review, I first noticed that the file sizes were quite large (> 1 mb) compared to previous days (< 5 kb), confirming the influx of data. After granting my user permissions to the KMS key used to encrypt the log files I was able to download, decompress, and investigate with `jq`. There were several chuncks available, I picked only the largest log file. First, I wanted to confirm activity in the logs matched the timeline I saw in Cost Explorer. ```sh ➭ jq '[.configurationItems | map({hour: .configurationItemCaptureTime[11:13]}) | group_by(.hour)[] | {hour: .[0].hour, count: length}]' sample.json ``` I wanted to see a few of the `CreatedDate` values to see if I was getting updates to old AWS Backup Recovery points: ```sh ➭ jq '.configurationItems | map(.configuration.CreationDate)' sample.json ``` In order to view a consumable amount of data I piped data from the start of my influx (19:55 UTC) into a file I could pretty-print and view in an editor: ```sh ➭ jq '.configurationItems | map(select(.configurationItemCaptureTime | startswith("2024-01-10T19:55")))' sample.json >> output-1955.json ``` Looking at this file, I saw that `CreationDate` in the configuration was from months ago, but none of the attributes jumped out at me, so I grabbed the `resourceId` and went back to the AWS Console in the AFT Management account. If you navigate to AWS Config > Resources, you can take the `resourceId` and put it into the Resource Identifier form field. Note, a few times I got the response "No resources pass your filter", and even saw "AWS Config is currently experiencing unusually high traffic. Try your request again or contact AWS support." I confirmed the existence in AWS Backup, refreshed a few times, scoped `resourceType` to "AWS::Backup::RecoveryPoint", and even toggled `Include deleted resources`. After some time, I saw my result. Select it and click the Resource Timeline button in the top right. Here I could see that `VaultType` was added to `Configuration`, and in reviewing several different `resourceId` values I found that something seemed to have triggered a single attribute update for all existing recovery points. At $0.003 per update, for ~25,000 recovery points, this totaled the near $75 charge I saw in AWS Cost Explorer. Looking at the documentation history for AWS Backup, there was a documentation update on January 10th, so I'm assuming that update triggered a whole bunch of charges for people using AWS Config and AWS Backup, either with or without this project.
hadesbox commented 9 months ago

Hi @mikeantonelli

thanks for your detailed investigation, we had the same problem the day 09/01/2024

image

The AWS Backup teams added this attribute to the AWS::Backup::RecoveryPoint and we had 40,200 recovery points reevaluated, since there is no retention on the recovery points and caused a spike on the billing, where some managers were really concerned about.

My suggestion is that this is fixed in ASAP, first of all by adding a retetion to the backup plan deployed by AFT in the following resource

backup.tf

resource "aws_backup_plan" "aft_controltower_backup_plan" {
  name = "aft-controltower-backup-plan"
  rule {
    rule_name         = "aft_controltower_backup_rule"
    target_vault_name = aws_backup_vault.aft_controltower_backup_vault.name
    schedule          = "cron(0 * * * ? *)"
  }
}

by adding the lifecycle attribute

resource "aws_backup_plan" "aft_controltower_backup_plan" {
  name = "aft-controltower-backup-plan"
  rule {
    rule_name         = "aft_controltower_backup_rule"
    target_vault_name = aws_backup_vault.aft_controltower_backup_vault.name
    schedule          = "cron(0 * * * ? *)"

    lifecycle {
      delete_after = 14
    }

  }
}

and this caused a spike of 100$ for Config Rule evaluations for that day.

hadesbox commented 9 months ago

If the lifecycle is added to the /modules/aft-account-request-framework/backup.tf this will should cause trigger the automatic deletion of all older backups when the next AFT update comes... so this should be noted on the change log/release of the next version for AFT... and will probably cause a spike on config again.

Maybe 14 days of retention is too low, and there should be a default value on AFT for this that can be configured externally to the aft-account-request-framework so each AWS customer can configure this according to its needs.

PeterBengtson commented 9 months ago

Has this been fixed? It's a pretty scandalous bug, as it incurs high costs without warning.

Sanjan611 commented 8 months ago

Hey everyone, we've added a feature to configure Backup recovery point retention period in the latest release of AFT!

https://github.com/aws-ia/terraform-aws-control_tower_account_factory/releases/tag/1.12.0