Closed adamkirk closed 2 years ago
I wonder if we have the same Issue in dynamo db like https://github.com/crossplane/provider-aws/issues/953
Maybe #909 is also affected by this bug?
I don't have bandwidth to look into this just now, but I wanted to thank you for the high quality issue, complete with failing tests @adamkirk!
yeah, I see the same thing:
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws 2021/11/23 20:59:08 recovery-test-6hpgd-9pzhn &v1beta1.RDSInstanceParameters{
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws ... // 1 ignored field
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws AllocatedStorage: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws - AutoMinorVersionUpgrade: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws + AutoMinorVersionUpgrade: &false,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws AvailabilityZone: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws BackupRetentionPeriod: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws ... // 2 ignored and 8 identical fields
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws - DeletionProtection: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws + DeletionProtection: &false,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws EnableCloudwatchLogsExports: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws EnableIAMDatabaseAuthentication: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws ... // 3 ignored and 16 identical fields
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws ProcessorFeatures: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws PromotionTier: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws - PubliclyAccessible: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws + PubliclyAccessible: &false,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws ScalingConfiguration: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws StorageEncrypted: nil,
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws ... // 10 ignored and 8 identical fields
crossplane-provider-aws-37927c5027e5-659877bbfc-z2gxk provider-aws }
We're running with a patch like this at the moment:
iff --git a/pkg/clients/rds/rds.go b/pkg/clients/rds/rds.go
index a8248571..e435a106 100644
--- a/pkg/clients/rds/rds.go
+++ b/pkg/clients/rds/rds.go
@@ -19,6 +19,7 @@ package rds
import (
"context"
"encoding/json"
+ "log"
"strconv"
"strings"
@@ -452,7 +453,7 @@ func IsUpToDate(ctx context.Context, kube client.Client, r *v1beta1.RDSInstance,
if err != nil {
return false, err
}
- return cmp.Equal(&v1beta1.RDSInstanceParameters{}, patch, cmpopts.EquateEmpty(),
+ diff := cmp.Diff(&v1beta1.RDSInstanceParameters{}, patch, cmpopts.EquateEmpty(),
cmpopts.IgnoreTypes(&xpv1.Reference{}, &xpv1.Selector{}, []xpv1.Reference{}),
cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "Region"),
cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "Tags"),
@@ -461,7 +462,12 @@ func IsUpToDate(ctx context.Context, kube client.Client, r *v1beta1.RDSInstance,
cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "ApplyModificationsImmediately"),
cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "AllowMajorVersionUpgrade"),
cmpopts.IgnoreFields(v1beta1.RDSInstanceParameters{}, "MasterPasswordSecretRef"),
- ) && !pwdChanged, nil
+ )
+ if diff != "" {
+ log.Println(r.ObjectMeta.Name, diff)
+ }
+
+ return diff == "" && !pwdChanged, nil
}
// GetPassword fetches the referenced input password for an RDSInstance CRD and determines whether it has changed or not
So I think this patch:
func LateInitialize(in *v1beta1.RDSInstanceParameters,...
- in.BackupRetentionPeriod = awsclients.LateInitializeIntPtr(in.BackupRetentionPeriod, db.BackupRetentionPeriod)
+ in.BackupRetentionPeriod = awsclients.LateInitializeIntFrom32Ptr(in.BackupRetentionPeriod, &db.BackupRetentionPeriod)
in.CACertificateIdentifier = awsclients.LateInitializeStringPtr(in.CACertificateIdentifier, db.CACertificateIdentifier)
in.CharacterSetName = awsclients.LateInitializeStringPtr(in.CharacterSetName, db.CharacterSetName)
- in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, db.CopyTagsToSnapshot)
+ in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, awsclients.Bool(db.CopyTagsToSnapshot))
...
caused the issue during the SDK upgrade. Trying out a patch tonight:
commit 04827b0049aee68152c971d2bcdc17d4b098625e (HEAD -> release-0.20-sb1-3)
Author: Carl Henrik Lunde <chlunde@ifi.uio.no>
Date: Tue Nov 23 22:20:15 2021 +0100
rds.rdsinstance: Compare bool correctly via LateInitialize
False booleans were converted to nil after the SDK upgrade. Ensure false
is represented as a pointer to false instead by settings
awsclients.FieldRequired.
Fixes #960
Signed-off-by: Carl Henrik Lunde <chlunde@ifi.uio.no>
diff --git a/pkg/clients/rds/rds.go b/pkg/clients/rds/rds.go
index b1221733..2f85e612 100644
--- a/pkg/clients/rds/rds.go
+++ b/pkg/clients/rds/rds.go
@@ -419,16 +419,16 @@ func LateInitialize(in *v1beta1.RDSInstanceParameters, db *rdstypes.DBInstance)
in.Engine = awsclients.LateInitializeString(in.Engine, db.Engine)
in.AllocatedStorage = awsclients.LateInitializeIntFrom32Ptr(in.AllocatedStorage, &db.AllocatedStorage)
- in.AutoMinorVersionUpgrade = awsclients.LateInitializeBoolPtr(in.AutoMinorVersionUpgrade, awsclients.Bool(db.AutoMinorVersionUpgrade))
+ in.AutoMinorVersionUpgrade = awsclients.LateInitializeBoolPtr(in.AutoMinorVersionUpgrade, awsclients.Bool(db.AutoMinorVersionUpgrade, awsclients.FieldRequired))
in.AvailabilityZone = awsclients.LateInitializeStringPtr(in.AvailabilityZone, db.AvailabilityZone)
in.BackupRetentionPeriod = awsclients.LateInitializeIntFrom32Ptr(in.BackupRetentionPeriod, &db.BackupRetentionPeriod)
in.CACertificateIdentifier = awsclients.LateInitializeStringPtr(in.CACertificateIdentifier, db.CACertificateIdentifier)
in.CharacterSetName = awsclients.LateInitializeStringPtr(in.CharacterSetName, db.CharacterSetName)
- in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, awsclients.Bool(db.CopyTagsToSnapshot))
+ in.CopyTagsToSnapshot = awsclients.LateInitializeBoolPtr(in.CopyTagsToSnapshot, awsclients.Bool(db.CopyTagsToSnapshot, awsclients.FieldRequired))
in.DBClusterIdentifier = awsclients.LateInitializeStringPtr(in.DBClusterIdentifier, db.DBClusterIdentifier)
in.DBName = awsclients.LateInitializeStringPtr(in.DBName, db.DBName)
- in.DeletionProtection = awsclients.LateInitializeBoolPtr(in.DeletionProtection, awsclients.Bool(db.DeletionProtection))
- in.EnableIAMDatabaseAuthentication = awsclients.LateInitializeBoolPtr(in.EnableIAMDatabaseAuthentication, awsclients.Bool(db.IAMDatabaseAuthenticationEnabled))
+ in.DeletionProtection = awsclients.LateInitializeBoolPtr(in.DeletionProtection, awsclients.Bool(db.DeletionProtection, awsclients.FieldRequired))
+ in.EnableIAMDatabaseAuthentication = awsclients.LateInitializeBoolPtr(in.EnableIAMDatabaseAuthentication, awsclients.Bool(db.IAMDatabaseAuthenticationEnabled, awsclients.FieldRequired))
in.EnablePerformanceInsights = awsclients.LateInitializeBoolPtr(in.EnablePerformanceInsights, db.PerformanceInsightsEnabled)
in.IOPS = awsclients.LateInitializeIntFrom32Ptr(in.IOPS, db.Iops)
in.KMSKeyID = awsclients.LateInitializeStringPtr(in.KMSKeyID, db.KmsKeyId)
@@ -437,14 +437,14 @@ func LateInitialize(in *v1beta1.RDSInstanceParameters, db *rdstypes.DBInstance)
in.MaxAllocatedStorage = awsclients.LateInitializeIntFrom32Ptr(in.MaxAllocatedStorage, db.MaxAllocatedStorage)
in.MonitoringInterval = awsclients.LateInitializeIntFrom32Ptr(in.MonitoringInterval, db.MonitoringInterval)
in.MonitoringRoleARN = awsclients.LateInitializeStringPtr(in.MonitoringRoleARN, db.MonitoringRoleArn)
- in.MultiAZ = awsclients.LateInitializeBoolPtr(in.MultiAZ, awsclients.Bool(db.MultiAZ))
+ in.MultiAZ = awsclients.LateInitializeBoolPtr(in.MultiAZ, awsclients.Bool(db.MultiAZ, awsclients.FieldRequired))
in.PerformanceInsightsKMSKeyID = awsclients.LateInitializeStringPtr(in.PerformanceInsightsKMSKeyID, db.PerformanceInsightsKMSKeyId)
in.PerformanceInsightsRetentionPeriod = awsclients.LateInitializeIntFrom32Ptr(in.PerformanceInsightsRetentionPeriod, db.PerformanceInsightsRetentionPeriod)
in.PreferredBackupWindow = awsclients.LateInitializeStringPtr(in.PreferredBackupWindow, db.PreferredBackupWindow)
in.PreferredMaintenanceWindow = awsclients.LateInitializeStringPtr(in.PreferredMaintenanceWindow, db.PreferredMaintenanceWindow)
in.PromotionTier = awsclients.LateInitializeIntFrom32Ptr(in.PromotionTier, db.PromotionTier)
- in.PubliclyAccessible = awsclients.LateInitializeBoolPtr(in.PubliclyAccessible, awsclients.Bool(db.PubliclyAccessible))
- in.StorageEncrypted = awsclients.LateInitializeBoolPtr(in.StorageEncrypted, awsclients.Bool(db.StorageEncrypted))
+ in.PubliclyAccessible = awsclients.LateInitializeBoolPtr(in.PubliclyAccessible, awsclients.Bool(db.PubliclyAccessible, awsclients.FieldRequired))
+ in.StorageEncrypted = awsclients.LateInitializeBoolPtr(in.StorageEncrypted, awsclients.Bool(db.StorageEncrypted, awsclients.FieldRequired))
in.StorageType = awsclients.LateInitializeStringPtr(in.StorageType, db.StorageType)
in.Timezone = awsclients.LateInitializeStringPtr(in.Timezone, db.Timezone)
apparently not that trivial, tests now fail like this, and the tests were not changed for SDK v2. Too late for me too look more into this at this moment, so feel free to take a look someone else.
=== RUN TestLateInitialize/ProcessorFeaturesNotOverwritten
rds_test.go:1009: r: -want, +got:
v1beta1.RDSInstanceParameters{
Region: nil,
AllocatedStorage: nil,
- AutoMinorVersionUpgrade: nil,
+ AutoMinorVersionUpgrade: &false,
AvailabilityZone: nil,
BackupRetentionPeriod: nil,
CACertificateIdentifier: nil,
CharacterSetName: nil,
- CopyTagsToSnapshot: nil,
+ CopyTagsToSnapshot: &false,
DBClusterIdentifier: nil,
DBInstanceClass: "",
... // 3 identical fields
DBSubnetGroupNameRef: nil,
DBSubnetGroupNameSelector: nil,
- DeletionProtection: nil,
+ DeletionProtection: &false,
EnableCloudwatchLogsExports: nil,
- EnableIAMDatabaseAuthentication: nil,
+ EnableIAMDatabaseAuthentication: &false,
EnablePerformanceInsights: nil,
Engine: "",
... // 8 identical fields
MonitoringRoleARNRef: nil,
MonitoringRoleARNSelector: nil,
- MultiAZ: nil,
+ MultiAZ: &false,
PerformanceInsightsKMSKeyID: nil,
PerformanceInsightsRetentionPeriod: nil,
... // 3 identical fields
ProcessorFeatures: {{Name: "existing", Value: "existing"}},
PromotionTier: nil,
- PubliclyAccessible: nil,
+ PubliclyAccessible: &false,
ScalingConfiguration: nil,
- StorageEncrypted: nil,
+ StorageEncrypted: &false,
StorageType: nil,
Tags: nil,
... // 16 identical fields
}
Maybe #909 is also affected by this bug?
Yeah I've just tried to spin up some new replicationgroups and they never become ready. Not sure how related it is to this, but I think these issues are mostly happening since the change from aws sdk v1 to v2, which was quite a big change so it seems to make sense.
I've reverted the provider back to 0.19.0 at this point (last known stable version for us).
Thank you all for debugging here!
It seems like SDK started to use bool
instead of *bool
for some fields but not for all of them and we accounted for that change by taking address of such fields via awsclients.Bool(bool) *bool
, which returns nil
for false
values. However, the correct thing to do would be as implemented in https://github.com/crossplane/provider-aws/pull/971 because if SDK returns false
, it's simply false
and that's how it should be treated instead of nil
.
@eloo In replication group code, I couldn't see such a change. Maybe automaticFailoverEnabled(af elasticachetypes.AutomaticFailoverStatus) *bool
could be a good candidate to check, i.e. return &false
instead of nil
depending on what it's compared against.
Thank you all for debugging here!
It seems like SDK started to use
bool
instead of*bool
for some fields but not for all of them and we accounted for that change by taking address of such fields viaawsclients.Bool(bool) *bool
, which returnsnil
forfalse
values. However, the correct thing to do would be as implemented in #971 because if SDK returnsfalse
, it's simplyfalse
and that's how it should be treated instead ofnil
.
This is what I've been thinking, I've tried all sorts of things tonight and it seems every time i fix one scenario it breaks another :')
I did try the fix from that PR earlier during my experiments, and saw the same test failures @chlunde mentioned in this comment. But I do feel like the tests can just be updated to reflect the new 'default' values for these boolean fields; it feels like they should default to (*bool)(false)
as that reflects the aws sdk.
I'm not sure what side affects it may have if we change those values from (*bool)(nil)
to (*bool)(false)
, but I'd be happy to try out a fix on our staging server to be fair. I'll try find some time tomorrow to deploy a patch and see if anything blows up :')
@muvaf i'm not still not what the real root cause is but i have teared down the problem to the following points:
further its nearly impossible to debug this without changing the code. so it would be great if every resource could log the reason of an update, like the diff we see in iamrole.
What happened?
After updating to crossplane 1.5.0, and provider-aws 0.21.0, I could see in logs (when i enable --debug on this provider) that update requests were being sent every time my rdsinstances were reconciled. I looked into cloudtrail to see what was being updated and could see that the requestParameters for the ModifyDBInstance calls being made by my crossplane user looked like this:
The thing that stood out to me was that the parameters being sent in the update request were all booleans that were set to false, and nothing else (other than the id). I checked these values against my rds instances using the
aws rds describe-db-instances --db-instance-identifier xxxxxxx
command and could see that all of these values in the request were already set to false on the instances in AWS; so these ModifyDBInstance requests shouldn't be getting sent at all as far i can tell.This doesn't seem to cause any problems in the rdsinstance itself, but the constant update requests seemed a bit worrying.
How can we reproduce it?
I think this can be recreated by creating an rdsinstance and setting publiclyAccesible or multiAZ to false. It's unclear exactly which fields this may happen with.
This is the spec i'm using for the rdsinstance (with some fields redacted):
What environment did it happen in?
Crossplane version: 1.5.0 Kubernetes v1.21.2 EKS v1.21.2-eks-06eac09
Extra notes
I did some investigation into the provider myself, and it seems that there is a bug in the
IsUpToDate
function of the rds client. When the rdsinstance has a boolean setting that is false (only tested with the PubliclyAccessible field), and the CRD also has that setting equal to false, it returns that the external is NOT up to date when it should be.It looks like this may be due to how booleans can be returned as a nil pointer when they're false rather than a pointer to a false value. See func
Bool
inpkg/clients/aws.go
.I've opened a draft PR with some tests that seem to highlight the problems with the IsUpToDate and CreatePatch functions in the rds client. Hopefully this helps: https://github.com/crossplane/provider-aws/pull/959