crossplane-contrib / provider-aws

Crossplane AWS Provider
Apache License 2.0
441 stars 375 forks source link

Unnecessary update requests of rdsinstances #960

Closed adamkirk closed 2 years ago

adamkirk commented 3 years ago

What happened?

After updating to crossplane 1.5.0, and provider-aws 0.21.0, I could see in logs (when i enable --debug on this provider) that update requests were being sent every time my rdsinstances were reconciled. I looked into cloudtrail to see what was being updated and could see that the requestParameters for the ModifyDBInstance calls being made by my crossplane user looked like this:

"requestParameters": {
    "allowMajorVersionUpgrade": false,
    "dBInstanceIdentifier": "xxxxxxxx",
    "multiAZ": false,
    "applyImmediately": false,
    "publiclyAccessible": false
},

The thing that stood out to me was that the parameters being sent in the update request were all booleans that were set to false, and nothing else (other than the id). I checked these values against my rds instances using the aws rds describe-db-instances --db-instance-identifier xxxxxxx command and could see that all of these values in the request were already set to false on the instances in AWS; so these ModifyDBInstance requests shouldn't be getting sent at all as far i can tell.

This doesn't seem to cause any problems in the rdsinstance itself, but the constant update requests seemed a bit worrying.

How can we reproduce it?

I think this can be recreated by creating an rdsinstance and setting publiclyAccesible or multiAZ to false. It's unclear exactly which fields this may happen with.

This is the spec i'm using for the rdsinstance (with some fields redacted):


spec:
  deletionPolicy: Orphan
  forProvider:
    allocatedStorage: 100
    allowMajorVersionUpgrade: false
    autoMinorVersionUpgrade: true
    availabilityZone: eu-west-2b
    backupRetentionPeriod: 1
    caCertificateIdentifier: rds-ca-2019
    dbInstanceClass: db.t2.micro
    dbName: xxxxxxxx
    dbParameterGroupName: default.mysql5.7
    dbSubnetGroupName: dbsubnet-group-name
    deletionProtection: true
    enablePerformanceInsights: false
    engine: mysql
    engineVersion: 5.7.33
    licenseModel: general-public-license
    masterUsername: xxxxxxxx
    multiAZ: false
    port: 3306
    preferredBackupWindow: 03:00-04:00
    preferredMaintenanceWindow: thu:00:26-thu:00:56
    publiclyAccessible: false
    region: eu-west-2
    skipFinalSnapshotBeforeDeletion: false
    storageEncrypted: true
    storageType: gp2
    tags:
    - key: crossplane-kind
      value: rdsinstance.database.aws.crossplane.io
    - key: crossplane-name
      value: xxxxxxxxxxx
    - key: crossplane-providerconfig
      value: default
    - key: myorg.tag1
      value: xxxxxxxx
    - key: myorg.tag2
      value: xxxxxxxx
    - key: myorg.tag3
      value: xxxxxxxx
    vpcSecurityGroupIds:
    - sg-xxxxxxxxx
  providerConfigRef:
    name: default
  writeConnectionSecretToRef:
    name: xxxxxxxxxxxxx
    namespace: crossplane

What environment did it happen in?

Crossplane version: 1.5.0 Kubernetes v1.21.2 EKS v1.21.2-eks-06eac09

Extra notes

I did some investigation into the provider myself, and it seems that there is a bug in the IsUpToDate function of the rds client. When the rdsinstance has a boolean setting that is false (only tested with the PubliclyAccessible field), and the CRD also has that setting equal to false, it returns that the external is NOT up to date when it should be.

It looks like this may be due to how booleans can be returned as a nil pointer when they're false rather than a pointer to a false value. See func Bool in pkg/clients/aws.go.

I've opened a draft PR with some tests that seem to highlight the problems with the IsUpToDate and CreatePatch functions in the rds client. Hopefully this helps: https://github.com/crossplane/provider-aws/pull/959

haarchri commented 3 years ago

I wonder if we have the same Issue in dynamo db like https://github.com/crossplane/provider-aws/issues/953

eloo commented 3 years ago

Maybe #909 is also affected by this bug?

negz commented 3 years ago

I don't have bandwidth to look into this just now, but I wanted to thank you for the high quality issue, complete with failing tests @adamkirk!

chlunde commented 3 years ago

yeah, I see the same thing:

crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws 2021/11/23 20:59:08 recovery-test-6hpgd-9pzhn   &v1beta1.RDSInstanceParameters{
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      ... // 1 ignored field
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      AllocatedStorage:        nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws -    AutoMinorVersionUpgrade: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws +    AutoMinorVersionUpgrade: &false,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      AvailabilityZone:        nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      BackupRetentionPeriod:   nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      ... // 2 ignored and 8 identical fields
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws -    DeletionProtection:              nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws +    DeletionProtection:              &false,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      EnableCloudwatchLogsExports:     nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      EnableIAMDatabaseAuthentication: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      ... // 3 ignored and 16 identical fields
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      ProcessorFeatures:    nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      PromotionTier:        nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws -    PubliclyAccessible:   nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws +    PubliclyAccessible:   &false,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      ScalingConfiguration: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      StorageEncrypted:     nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws      ... // 10 ignored and 8 identical fields
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws   }

We're running with a patch like this at the moment:

iff --git a/pkg/clients/rds/rds.go b/pkg/clients/rds/rds.go
index a8248571..e435a106 100644
--- a/pkg/clients/rds/rds.go
+++ b/pkg/clients/rds/rds.go
@@ -19,6 +19,7 @@ package rds
 import (
        "context"
        "encoding/json"
+       "log"
        "strconv"
        "strings"

@@ -452,7 +453,7 @@ func IsUpToDate(ctx context.Context, kube client.Client, r *v1beta1.RDSInstance,
        if err != nil {
                return false, err
        }
-       return cmp.Equal(&v1beta1.RDSInstanceParameters{}, patch, cmpopts.EquateEmpty(),
+       diff := cmp.Diff(&v1beta1.RDSInstanceParameters{}, patch, cmpopts.EquateEmpty(),
                cmpopts.IgnoreTypes(&xpv1.Reference{}, &xpv1.Selector{}, []xpv1.Reference{}),
                cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "Region"),
                cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "Tags"),
@@ -461,7 +462,12 @@ func IsUpToDate(ctx context.Context, kube client.Client, r *v1beta1.RDSInstance,
                cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "ApplyModificationsImmediately"),
                cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "AllowMajorVersionUpgrade"),
                cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "MasterPasswordSecretRef"),
-       ) && !pwdChanged, nil
+       )
+       if diff != "" {
+               log.Println(r.ObjectMeta.Name, diff)
+       }
+
+       return diff == "" && !pwdChanged, nil
 }

 // GetPassword fetches the referenced input password for an RDSInstance CRD and determines whether it has changed or not
chlunde commented 3 years ago

So I think this patch:

func LateInitialize(in *v1beta1.RDSInstanceParameters,...
-       in.BackupRetentionPeriod = awsclients.LateInitializeIntPtr(in.BackupRetentionPeriod, db.BackupRetentionPeriod)
+       in.BackupRetentionPeriod = awsclients.LateInitializeIntFrom32Ptr(in.BackupRetentionPeriod, &db.BackupRetentionPeriod)
        in.CACertificateIdentifier = awsclients.LateInitializeStringPtr(in.CACertificateIdentifier, db.CACertificateIdentifier)
        in.CharacterSetName = awsclients.LateInitializeStringPtr(in.CharacterSetName, db.CharacterSetName)
-       in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, db.CopyTagsToSnapshot)
+       in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, awsclients.Bool(db.CopyTagsToSnapshot))
...

caused the issue during the SDK upgrade. Trying out a patch tonight:

commit 04827b0049aee68152c971d2bcdc17d4b098625e (HEAD -> release-0.20-sb1-3)
Author: Carl Henrik Lunde <chlunde@ifi.uio.no>
Date:   Tue Nov 23 22:20:15 2021 +0100

    rds.rdsinstance: Compare bool correctly via LateInitialize

    False booleans were converted to nil after the SDK upgrade. Ensure false
    is represented as a pointer to false instead by settings
    awsclients.FieldRequired.

    Fixes #960

    Signed-off-by: Carl Henrik Lunde <chlunde@ifi.uio.no>

diff --git a/pkg/clients/rds/rds.go b/pkg/clients/rds/rds.go
index b1221733..2f85e612 100644
--- a/pkg/clients/rds/rds.go
+++ b/pkg/clients/rds/rds.go
@@ -419,16 +419,16 @@ func LateInitialize(in *v1beta1.RDSInstanceParameters, db *rdstypes.DBInstance)
        in.Engine = awsclients.LateInitializeString(in.Engine, db.Engine)

        in.AllocatedStorage = awsclients.LateInitializeIntFrom32Ptr(in.AllocatedStorage, &db.AllocatedStorage)
-       in.AutoMinorVersionUpgrade = awsclients.LateInitializeBoolPtr(in.AutoMinorVersionUpgrade, awsclients.Bool(db.AutoMinorVersionUpgrade))
+       in.AutoMinorVersionUpgrade = awsclients.LateInitializeBoolPtr(in.AutoMinorVersionUpgrade, awsclients.Bool(db.AutoMinorVersionUpgrade, awsclients.FieldRequired))
        in.AvailabilityZone = awsclients.LateInitializeStringPtr(in.AvailabilityZone, db.AvailabilityZone)
        in.BackupRetentionPeriod = awsclients.LateInitializeIntFrom32Ptr(in.BackupRetentionPeriod, &db.BackupRetentionPeriod)
        in.CACertificateIdentifier = awsclients.LateInitializeStringPtr(in.CACertificateIdentifier, db.CACertificateIdentifier)
        in.CharacterSetName = awsclients.LateInitializeStringPtr(in.CharacterSetName, db.CharacterSetName)
-       in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, awsclients.Bool(db.CopyTagsToSnapshot))
+       in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, awsclients.Bool(db.CopyTagsToSnapshot, awsclients.FieldRequired))
        in.DBClusterIdentifier = awsclients.LateInitializeStringPtr(in.DBClusterIdentifier, db.DBClusterIdentifier)
        in.DBName = awsclients.LateInitializeStringPtr(in.DBName, db.DBName)
-       in.DeletionProtection = awsclients.LateInitializeBoolPtr(in.DeletionProtection, awsclients.Bool(db.DeletionProtection))
-       in.EnableIAMDatabaseAuthentication = awsclients.LateInitializeBoolPtr(in.EnableIAMDatabaseAuthentication, awsclients.Bool(db.IAMDatabaseAuthenticationEnabled))
+       in.DeletionProtection = awsclients.LateInitializeBoolPtr(in.DeletionProtection, awsclients.Bool(db.DeletionProtection, awsclients.FieldRequired))
+       in.EnableIAMDatabaseAuthentication = awsclients.LateInitializeBoolPtr(in.EnableIAMDatabaseAuthentication, awsclients.Bool(db.IAMDatabaseAuthenticationEnabled, awsclients.FieldRequired))
        in.EnablePerformanceInsights = awsclients.LateInitializeBoolPtr(in.EnablePerformanceInsights, db.PerformanceInsightsEnabled)
        in.IOPS = awsclients.LateInitializeIntFrom32Ptr(in.IOPS, db.Iops)
        in.KMSKeyID = awsclients.LateInitializeStringPtr(in.KMSKeyID, db.KmsKeyId)
@@ -437,14 +437,14 @@ func LateInitialize(in *v1beta1.RDSInstanceParameters, db *rdstypes.DBInstance)
        in.MaxAllocatedStorage = awsclients.LateInitializeIntFrom32Ptr(in.MaxAllocatedStorage, db.MaxAllocatedStorage)
        in.MonitoringInterval = awsclients.LateInitializeIntFrom32Ptr(in.MonitoringInterval, db.MonitoringInterval)
        in.MonitoringRoleARN = awsclients.LateInitializeStringPtr(in.MonitoringRoleARN, db.MonitoringRoleArn)
-       in.MultiAZ = awsclients.LateInitializeBoolPtr(in.MultiAZ, awsclients.Bool(db.MultiAZ))
+       in.MultiAZ = awsclients.LateInitializeBoolPtr(in.MultiAZ, awsclients.Bool(db.MultiAZ, awsclients.FieldRequired))
        in.PerformanceInsightsKMSKeyID = awsclients.LateInitializeStringPtr(in.PerformanceInsightsKMSKeyID, db.PerformanceInsightsKMSKeyId)
        in.PerformanceInsightsRetentionPeriod = awsclients.LateInitializeIntFrom32Ptr(in.PerformanceInsightsRetentionPeriod, db.PerformanceInsightsRetentionPeriod)
        in.PreferredBackupWindow = awsclients.LateInitializeStringPtr(in.PreferredBackupWindow, db.PreferredBackupWindow)
        in.PreferredMaintenanceWindow = awsclients.LateInitializeStringPtr(in.PreferredMaintenanceWindow, db.PreferredMaintenanceWindow)
        in.PromotionTier = awsclients.LateInitializeIntFrom32Ptr(in.PromotionTier, db.PromotionTier)
-       in.PubliclyAccessible = awsclients.LateInitializeBoolPtr(in.PubliclyAccessible, awsclients.Bool(db.PubliclyAccessible))
-       in.StorageEncrypted = awsclients.LateInitializeBoolPtr(in.StorageEncrypted, awsclients.Bool(db.StorageEncrypted))
+       in.PubliclyAccessible = awsclients.LateInitializeBoolPtr(in.PubliclyAccessible, awsclients.Bool(db.PubliclyAccessible, awsclients.FieldRequired))
+       in.StorageEncrypted = awsclients.LateInitializeBoolPtr(in.StorageEncrypted, awsclients.Bool(db.StorageEncrypted, awsclients.FieldRequired))
        in.StorageType = awsclients.LateInitializeStringPtr(in.StorageType, db.StorageType)
        in.Timezone = awsclients.LateInitializeStringPtr(in.Timezone, db.Timezone)
chlunde commented 3 years ago

apparently not that trivial, tests now fail like this, and the tests were not changed for SDK v2. Too late for me too look more into this at this moment, so feel free to take a look someone else.

=== RUN   TestLateInitialize/ProcessorFeaturesNotOverwritten
    rds_test.go:1009: r: -want, +got:
          v1beta1.RDSInstanceParameters{
            Region:                  nil,
            AllocatedStorage:        nil,
        -   AutoMinorVersionUpgrade: nil,
        +   AutoMinorVersionUpgrade: &false,
            AvailabilityZone:        nil,
            BackupRetentionPeriod:   nil,
            CACertificateIdentifier: nil,
            CharacterSetName:        nil,
        -   CopyTagsToSnapshot:      nil,
        +   CopyTagsToSnapshot:      &false,
            DBClusterIdentifier:     nil,
            DBInstanceClass:         "",
            ... // 3 identical fields
            DBSubnetGroupNameRef:            nil,
            DBSubnetGroupNameSelector:       nil,
        -   DeletionProtection:              nil,
        +   DeletionProtection:              &false,
            EnableCloudwatchLogsExports:     nil,
        -   EnableIAMDatabaseAuthentication: nil,
        +   EnableIAMDatabaseAuthentication: &false,
            EnablePerformanceInsights:       nil,
            Engine:                          "",
            ... // 8 identical fields
            MonitoringRoleARNRef:               nil,
            MonitoringRoleARNSelector:          nil,
        -   MultiAZ:                            nil,
        +   MultiAZ:                            &false,
            PerformanceInsightsKMSKeyID:        nil,
            PerformanceInsightsRetentionPeriod: nil,
            ... // 3 identical fields
            ProcessorFeatures:    {{Name: "existing", Value: "existing"}},
            PromotionTier:        nil,
        -   PubliclyAccessible:   nil,
        +   PubliclyAccessible:   &false,
            ScalingConfiguration: nil,
        -   StorageEncrypted:     nil,
        +   StorageEncrypted:     &false,
            StorageType:          nil,
            Tags:                 nil,
            ... // 16 identical fields
          }
adamkirk commented 3 years ago

Maybe #909 is also affected by this bug?

Yeah I've just tried to spin up some new replicationgroups and they never become ready. Not sure how related it is to this, but I think these issues are mostly happening since the change from aws sdk v1 to v2, which was quite a big change so it seems to make sense.

I've reverted the provider back to 0.19.0 at this point (last known stable version for us).

muvaf commented 3 years ago

Thank you all for debugging here!

It seems like SDK started to use bool instead of *bool for some fields but not for all of them and we accounted for that change by taking address of such fields via awsclients.Bool(bool) *bool, which returns nil for false values. However, the correct thing to do would be as implemented in https://github.com/crossplane/provider-aws/pull/971 because if SDK returns false, it's simply false and that's how it should be treated instead of nil.

@eloo In replication group code, I couldn't see such a change. Maybe automaticFailoverEnabled(af elasticachetypes.AutomaticFailoverStatus) *bool could be a good candidate to check, i.e. return &false instead of nil depending on what it's compared against.

adamkirk commented 3 years ago

Thank you all for debugging here!

It seems like SDK started to use bool instead of *bool for some fields but not for all of them and we accounted for that change by taking address of such fields via awsclients.Bool(bool) *bool, which returns nil for false values. However, the correct thing to do would be as implemented in #971 because if SDK returns false, it's simply false and that's how it should be treated instead of nil.

This is what I've been thinking, I've tried all sorts of things tonight and it seems every time i fix one scenario it breaks another :')

I did try the fix from that PR earlier during my experiments, and saw the same test failures @chlunde mentioned in this comment. But I do feel like the tests can just be updated to reflect the new 'default' values for these boolean fields; it feels like they should default to (*bool)(false) as that reflects the aws sdk.

I'm not sure what side affects it may have if we change those values from (*bool)(nil) to (*bool)(false), but I'd be happy to try out a fix on our staging server to be fair. I'll try find some time tomorrow to deploy a patch and see if anything blows up :')

eloo commented 3 years ago

@muvaf i'm not still not what the real root cause is but i have teared down the problem to the following points:

further its nearly impossible to debug this without changing the code. so it would be great if every resource could log the reason of an update, like the diff we see in iamrole.