Closed cest-pas-faux closed 1 year ago
Might be related to #828 - but I don't understand how you have only so many AssumeRole and so few AssumeRoleWithWebIdentity. What kind of ProviderConfig do you use?
Could you also take a look into what kind of resource is tagged with AddTagsToResource
? Can you then do kubectl get -w -o yaml
on the managed resource and see what changes?
Re. RDSInstance, I wonder if you could upgrade to 0.29 to check if the situation has improved. The only issue I know of now is if you use UpperCase in maintenance/backup window, AWS uses lower case.
Hello, thanks for your answer, let me explain more of our setup :
We have a multi-tenant cluster and to avoid any configuration mistakes from our clients, we made our own Kinds available as Composition
using CompositeResourceDefinition
, there is one for MySQL (Kind : MysqlInstance) and one for PostgreSQL (Kind : PostgresInstance).
You can find below the ones for MySQL (postgres is basically the same, only the values changes), note that I replaced our company name by "company".
While checking our setup I found this after describing an instance, seems like the kind keeps being rewritten and that may be the trigger for all of our issues.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal BindCompositeResource 29m (x6808 over 20d) offered/compositeresourcedefinition.apiextensions.crossplane.io Successfully bound composite resource
Normal ConfigureCompositeResource 6m59s (x6834 over 20d) offered/compositeresourcedefinition.apiextensions.crossplane.io Successfully applied composite resource
And all of our objetcs are impacted, the number increase each time I run the command.
$ k get mysql,pgsql -A -o=custom-columns='RESOURCE_VERSION:.metadata.resourceVersion'
RESOURCE_VERSION
92920052
92920175
92919353
92919299
[...]
AddTagsToResource
?I gathered all events from the last 10 hours and the targets are only the RDS DB object, it doesnt impact any other object (parameter group, security group ...).
However, I found out that not all the instances are on the list, and I found why, the instances with an empty rdsInstance.forProvider.dbName
are the ones being Tagged over and over.
That's something we already made mandatory on our latests templates, people may be using an outdated version and we are going to fix this ASAP.
We need to test the impacts of a live upgrade on another cluster, could be during this week.
We do not specify the values, it uses the defaults
Thanks again for your help
Hello @chlunde ,
The situation is worsening, we are being throttled by AWS due to the calls increasing, and we requested a quota extension for this, however it's difficult to explain to AWS why there is nearly 50 calls per second for only 30 databases deployed.
'AddTagsToResource': 435,
'DescribeDBClusterParameterGroups': 17042,
'DescribeDBClusterParameters': 7925,
'DescribeDBClusters': 18020,
'DescribeDBInstances': 49925,
'DescribeDBParameterGroups': 16505,
'DescribeDBParameters': 15496,
'DescribeDBSubnetGroups': 28931,
'ListTagsForResource': 19208,
'ModifyDBInstance': 558,
'ModifyDBParameterGroup': 67
'Processed Events': 174112,
'Timeframe': '- 60 minutes'
We reduced the number of databases without rdsInstance.forProvider.dbName
, not sure why it's related, but AddTagsToResource
has decreased.
We also are planning to upgrade to 0.29.
EDIT : We suspect something, the patches in the compositions that check and posts the status from each component :
- type: ToCompositeFieldPath
fromFieldPath: status.conditions
toFieldPath: status.components.dbSubnetGroup
We are going to test to disable it, and if it does not break anything, deploy it in production and re-assess the calls numbers.
Do we have another way to mitigate those issues ?
Update : after some back-and-forth with AWS and a lot of digging from us, we found out that when the provider has an issue to sync an object, it keeps retrying endlessly.
As the calls are retrying, more and more calls are being throttled and it hits the rate limit and is stuck forever. AWS increased temporarily our rate limit and the provider managed to sync, then the calls dropped to a more normal rate.
From AWS :
It is important to highlight that the call rate dropped to 1/3 (from 50 req/s to 16 req/s) after the update,
this is due to the fact that you are constantly retrying when you get throttled.
Suggested to review the automation process to avoid those retry storms.
Those limits will be removed in two weeks from now, unless you have a valid business case to keep them.
Regarding our setup, we upgraded to 0.29.0 - no changes, we removed the components status - no changes. We are still wondering how to cap the calls number and -if possible- enable a ExponentialBackOff in order to avoid the situation that may happen again.
anyone was able to fix it? we are having the same problem too many api calls resulting in throttling, specially for cloudfront crossplane version: 1.10 aws-provider: 0.32 eks: 1.22
Thanks
The changes in https://github.com/crossplane-contrib/provider-aws/pull/1705 may have solved much of this issue - they are in the latest release - 0.39.0
Crossplane does not currently have enough maintainers to address every issue and pull request. This issue has been automatically marked as stale
because it has had no activity in the last 90 days. It will be closed in 14 days if no further activity occurs. Leaving a comment starting with /fresh
will mark this issue as not stale.
What happened?
Hello, we found out that most of the calls logged by Cloudtrail are from provider-aws, and it significantly impacts our Guarduty bill.
After some digging we tried the solution from #847 and added
--poll=5m
, which drastically reduced the calls number overall, however we think that the call rate is still very high, considering our setup.Here is the summary of the last hour calls, about 80% of
AssumeRole
, and 100%-ish ofModifyDBInstance
andAddTagsToResource
are from crossplane-provider-aws.We have a EKS cluster and are starting to migrate our clients from metal and we currently have created a small amount of RDS instances (we expect to have at least 3x this number), so we are affraid of an exponential increase of the calls to AWS in the next weeks :
Playing around with
max-reconcile-rate
didn't change anything andsync
default to 1 hour shouldn't have much impact.poll
argument but we are not sure if this is a good idea to reduce the crossplane reactivityAddTagsToResource
? (there is no changes to the tags once the instance is created)ModifyDBInstance
per hour for 26 RDS Instances seems a bit intense IMOThanks in advance and let me know if you need more informations.
How can we reproduce it?
Standard install of crossplane + provider-aws, create a bunch of rds instances and check cloudtrail
What environment did it happen in?
1.7.0
0.26.0
kubectl version
)EKS 1.21.5