Closed dee0 closed 3 years ago
So the data that I provided when I opened this defect was collected from logs from a CI/CD pipeline.
A short while ago I performed a manual test where I I had better ability to collect logs.
I deleted a claim and all of the API objects except the composition were deleted from k8s. The last conditions for the composition are conditions:
However at least at the moment if I try to delete the VPC in the AWS console the dialog indicates that it will be deleted and these resources will also be deleted rtb-07856e5a4ef84b858 rtb-0d6436f64d2c69c98 Note I have seen the case where there are dependent resources that cannot actually be deleted. In that case the AWS console tells you upfront you won't be able to delete the VPC while the dependencies exist. So my current case is really not the same as that.
That said, CloudTrails shows that Crossplane is still trying to delete the VPC and the delete attempts are failing with Client.DependencyViolation.
Those two dependent resources above are a couple of un-named route tables. Not sure why they are there.
While I am not familiar with the AWS api, at the moment I am thinking that there is a Crossplane bug where either
I examined a teammate's deploy and I see there are 3 named route tables, clearly from the composition, and one nameless one.
At the moment I don't understand where these nameless route tables are coming from or why my deployment had two while my teammate's had only one.
In any case, it seems these nameless routetables are some how related to the problem we are seeing.
The application of the claim happens with in k8s job. I just thought to check the logs of this job for both my teammate's deployment and mine.
I see in my teammate's deployment the job ran once. However in my deployment it ran twice. The first run failed while the second succeeded. The log for my deployment's first run ends with
+ /bin/kubectl apply -f /deployment-data/claim.yaml
vpc.aws.platform.dmi.sap.com/vpc-mc-i540621-dmi created
+ /bin/kubectl wait '--timeout=900s' '--for=condition=Ready' -n mc-provisioner VPC vpc-mc-i540621-dmi
Error from server (InternalError): an error on the server ("") has prevented the request from succeeding
The log for the second run ends with
+ /bin/kubectl apply -f /deployment-data/claim.yaml
vpc.aws.platform.dmi.sap.com/vpc-mc-i540621-dmi unchanged
+ /bin/kubectl wait '--timeout=900s' '--for=condition=Ready' -n mc-provisioner VPC vpc-mc-i540621-dmi
vpc.aws.platform.dmi.sap.com/vpc-mc-i540621-dmi condition met
That failure of the first run is something we have been seeing since we started using Crossplane but we have been ignoring since k8s always re-runs the job until there is success and it usually takes at most 3 tries to achieve this.
Am wondering now if the fact that my deployment has two unexpected, at least to me, nameless route tables associated with the VPC while my teammate's deployment just has one is somehow related to the circumstances around the job running multiple times. Say possibly the error from the API server also causes Crossplane to do something that causes the extra nameless route table to be created or us applying the claim multiple times causes it even though kubectl is reporting that vpc.aws.platform.dmi.sap.com/vpc-mc-i540621-dmi unchanged on the second run.
Hmm, just looked at CloudTrails events for vpc-14449-d66-dm ( the ci/cd case ) and vpc-0623d044576ddef33 ( my manual case ) and I think either there are two different problems or the nameless route tables are irrevevant.
In the case of vpc-14449-d66-dm only 3 route tables ( the expected number given the xrd ) were created while in the case of my manual those nameless route tables are hanging around.
@dee0 are the "nameless" route tables being created by another process? provider-aws
not force deleting dependent resources is by design -- we encourage management of all resources via Crossplane, and we do not want to delete resources that are not under Crossplane's management in any scenario.
@hasheddan The nameless routes are not being created by another process, they are being created by Crossplane. The timing, user id, client ip etc for the create event in CloudTrails indicate this.
Wonder if this could be related at all to either https://github.com/upbound/platform-ref-aws/issues/3#issuecomment-713025391 or the various issues linked from the discussion of that issue.
Aws provider log for my manual test ( the one involving vpc-mc-i540621-dmi ) i540621-dmi-aws-provider.log
Btw the problem of leaked resources, whatever the cause(s), happens often but not always. A cluster I was comparing against yesterday was successfully cleaned up this morning.
It would be interesting to understand more about the unexpected / "nameless" RouteTables before any attempt is made to delete resources and clean the cluster up. I want to see if we can find early on, before deletions come into play, if there are any leaked or untracked resources.
For any unexpected RouteTable, can you confirm:
kubectl get RouteTable -A
@jbw976 I'll try to get the answers as soon as I can. Btw in regards to the 1st question, is running kubectl get managed -A equivalent? I ask because I can say for sure that the nameless route tables don't show up in that output.
Spent the last while reviewing about VPC in AWS, 7 orphaned and the rest in use. All 7 seem to match the pattern I reported when I opened this defect. That is
None of the VPC, orphaned otherwise, have nameless route tables associated with them.
So I am still of the opinion that we have two, at least, ways that VPC are being orphaned. One where we run out of things to wait on and yet the VPC hasn't been removed from AWS and one where nameless route tables prevent deletion.
A word on the nameless routebables. Looks like there should be at least one associated with each VPC https://docs.aws.amazon.com/vpc/latest/userguide/VPC_Route_Tables.html https://docs.aws.amazon.com/vpc/latest/userguide/WorkWithRouteTables.html
Not sure how we ended up with two nameless route tables though. I did just performed a deploy where there was no problems and I saw that when I looked at the details of the single nameless route table associated with my VPC there something that said it was the 'main' route table. If/when I see the case of multiple nameless routetables I'll look at the both more carefully to try and understand what the diff between them is.
That said, I would like to re-iterate that I think there are two different cases we are dealing with
I would also like to ask again, why does Crossplane make so many failed attempts to delete the VPC even in the success case?
We have made some changes so that we can capture debug logs from Crossplane. The spreadsheet in the attached zip contains log data from kibana - This has the debug logging from Crossplane jenkins - This contains messages showing the clean up activity cloudtrail - This shows the events from AWS perspective
Each of the above is in its own worksheet and there is also a worksheet that combines the logs from all 3 sources.
The clocks are a little out of sync so the messages in the combined tab aren't 100% in the correct order. e.g. The clocks for the cluster and jenkins are about 1/10 second out of sync.
In the combined worksheet the first deletion message is at line 24, 2021-07-22 02:33:27.457 cluster time. Here's where you can see clocks are out of sync cause it proceeds the delete from the Jenkins log.
By line 554, 2021-07-22 02:35:29.990 jenkins time, our attempts to perform a controlled cleanup have completed.
As I mentioned near the beginning of this, it seems to me the problem is down to our clean up code not being able to accurately see what is happening with the VPC.
Looking at the 'combined' worksheet :
At line 33, 2021-07-22 02:33:27.522 jenkins time, we delete the claim vpc-mc-14814-2e1-dmi.
At line 34, 2021-07-22 02:33:27.522 jenkins time, we try to wait for the managed resource VPC vpc-mc-14814-2e1-dmi-ftm9p-9k8gb.
At line 36, 2021-07-22 02:33:28.000 jenkins time, you can see this failed with a 'resource not found' error.
However we can see from log messages a couple of minutes later the VPC still exists in AWS and Crossplane is still trying to delete it. e.g. Line 504, 2021-07-22 02:35:24.000 AWS time
So I think our problems begins with kubernetes/crossplane returning that 'not found' error when we tried to wait on vpc-mc-14814-2e1-dmi-ftm9p-9k8gb. <!-- @hasheddan Think this is the bug
Some other things of note in the combined tab
One last thing. By line 554, 2021-07-22 02:35:29.990 jenkins time, our attempts clean things up nicely have ended and clean up is turned over to Gardner. As mentioned before Gardener tries to delete everything it can find via the cluster api server. So after this you point you will see chaos.
Just checked another case of leaked resource and I see pattern.
Btw for each of these cases we are only creating one VPC. So it isn't the case that get managed is returning some VPC other than the one we deleted.
@hasheddan @jbw976 Have a couple more pieces of info
So as I said a while ago, seems like we are running into multiple problems. While my team can address the name collision problem easily enough I am not so sure about the extraneous route table.
2021-07-29-extra-route-table.zip
In the attached zip are
The extra route table was created at about July 29, 2021, 09:58:23 (UTC-07:00)
grep -Pi route.*created kibana.csv Turns up 4 events indicating successful creation of route tables ( expect only 3 ). There are two events for vpc-mc-i540621-dmi-ds67r-k6xxz, maybe that is a clue?
2nd-and-3rd-routetable-messages.zip
In the attached zip are log messages that I have extracted from kibana.csv ( see above ) and normalized for comparison
I normalized contents of these files by removing the timestamps, replacing the last portion of the route table name with 'tableid' and by replacing the values for resourceVersion with v1, v2, ....
Comparing the files I see
I'm thinking the spurious RouteTable creation could be a race in the relevant controller. RouteTables are one of the resources where we rely on the crossplane.io/external-name
being set to determine whether we've created the desired RouteTable or not per the below code:
The flow is:
Observe
is called. It reports that the RouteTable does not exist because the external-name
annotation is unset.Create
is called. It sends a CreateRouteTableRequest
then sets the external-name
annotation.You could imagine a race where 1. is called twice before 2. succeeds. I would imagine in this case we would:
external-name
is set, so calls CreateRouteTableRequest
.external-name
is set, so calls CreateRouteTableRequest
.external-name
.external-name
, but is rejected by the API server because the resource has changed.That said, I'm pretty sure we're running with MaxConcurrentReconciles = 1
so it's not clear to me how we'd even have two reconciles happening at once.
Fwiw we haven't done anything to increase the concurrency.
Have been tracing through the code to see where an extra event might come from or where they might be allowed through. I up to the the point where the rate limiting queue receives events. Haven't seen anything yet.
Btw my background is mostly with C++ and Java and with those languages I am used to having a thread id in log messages. That takes away a lot of the mystery when diagnosing problems of concurrency. Is there a way we can enable that for goroutines? Just thinking of how we might confirm your theory.
@negz If you haven't already would you please use a visual diff tool ( e.g. winmerge on Windows ) to compare vpc-mc-i540621-dmi-ds67r-k6xxz-events.txt and vpc-mc-i540621-dmi-ds67r-pxd9x-events.txt? ( See my post a bit earlier today )
Am hoping you will recognize something I do not, something that confirms your theory.
@dee0 The fact that k6xxz
has two CreatedExternalResource
calls, each at a different resourceVersion
, supports my theory above. That said, I would also expect to have seen a "the object has been modified; please apply your changes to the latest version and try again" error immediately after the second create. I note that we do have one of those in the logs but it appears to pertain to updating the resource's status - i.e. it says "cannot update managed resource status" - where I'd instead expect the error to instead be prefixed with "cannot update managed resource".
Checked the log and confirmed the workcount for the routetable controler is 1
"2021-07-29T16:58:02.288Z","2021-07-29T16:58:02.288Z INFO controller-runtime.manager.controller.managed/routetable.ec2.aws.crossplane.io Starting workers {""reconciler group"": ""ec2.aws.crossplane.io"", ""reconciler kind"": ""RouteTable"", ""worker count"": 1}"
@hasheddan @negz I think I might have a clue about how the scenario Nic described is happening. vpc-mc-i540621-dmi-ds67r-k6xxz-events-with-times-order-fixed.zip
In the attached zip is a file where I have taken the log messages related to the problematic managed resource vpc-mc-i540621-dmi-ds67r-k6xxz- and
Note: The order of the messages logs collected from Kibana aren't exactly in the right order because messages are sent as datagrams to Kibana. I was able to fix the order by sorting the messages on the timestamp that was part of the original log message.
Anyhoo with the order corrected I see that when the first route table is created for vpc-mc-i540621-dmi-ds67r-k6xxz Reconcile first reports a resource version id of 34683 and then 35015.
However when the second route table for vpc-mc-i540621-dmi-ds67r-k6xxz is created Reconcile first reports the resource id is 35012 and then 35017.
So it seems show how in the later call to Reconcile it fetched an out of date version of the RouteTable vpc-mc-i540621-dmi-ds67r-k6xxz.
Please review the log messages from 2021-07-29T16:58:22.982Z to 2021-07-29T16:58:23.780Z to confirm I am not misreading things.
Been reading through the code and at the moment I think the problem is with the caching. In the log I uploaded in my last post this is the problematic revision sequence 34683 // provider-aws log message 35015 // controller-runtime.manager.events message 35012 // provider-aws log message <!-- problem 35017 // controller-runtime.manager.events message
The provider-aws messages use data from a managed resource object filled by a call to client.Get which uses the cache. The controller-runtime.manager.events message messages use a managed resource object that has been updated by the response to client.Update. Looking at the code for client.Update I see it doesn't update the cache.
I'll try and sort out when the cache actually does get updated to see why this problem doesn't happen all the time for us.
Think I see where the concurrency problem comes from.
There are 5 categories of goroutines involved in processing resource events. (1) Reflector::Run Observes resources in API server and pushes deltas it detects into DeltaFifo. Updates the cache approximately 800ms There are one of these for each managed resource type.
(2) sharedIndexInformer::Run Pops updates from DeltaFifo and pushes notifications to the channel processorListener:addCh There are one of these for each managed resource type.
(3) processorListener:pop Picks updates to addCh and pushes them to channel processorListener:nextCh, possibly buffering them in an expanding circular buffer first. There are one of these for each managed resource type.
(4) processListener:run Picks up items from processorListener:nextCh and for each generates a ReconcileRequest which it adds to the rateLimitingType which is the Controller's Queue. A ReconcileRequest just contains a namespace qualified name. There are one of these for each managed resource type.
(5) Controller's workers Processes ReconcileRequest as they become available in Controller's Queue. Fetches resource items from the cache updated by goroutine (1). Blocks until there are updates from goroutine (1) that hasn't been picked up by (2). There is just one of these.
So here is what is happening
I think ideally controller-runtime would utilize a write-through cache. At the moment I suspect that would be major surgery. Not sure, but perhaps the Reconciler could detect this situation and requeue the request.
@hasheddan @negz Think this may be what happened with the reconciles
Read through the code yesterday and maybe adding a right through cache isn't as big of a job as I thought. Still pretty big, but not as big. :)
negz asked me to write up some of the options that have been floated for fixing this problem.
Add crd _Type_Reservation, e.g. RouteTableReservation. A reservation can be used to track whether process of creating the external resource has started or fulfilled. So the happy-path would look like, for example,
In the case where a stale RouteTable was fetched from the cache the attempt to create RouteTableReservation would fail. Course the challenge with this would be dealing with orphaned, unfulfilled, RouteTableReservation in the event of a panic or other major failure.
Btw I think the 4 comments just before this one do a good job of summarizing what is happening.
Another bandaid option
Instead of having ResolveReferences update the managed object in kubernetes have it return all of the external ids of the references if they are available. If they are not then requeue as is done today.
Then just set them on the managed object when setting the external-name.
I'm going to self assign this in the (possibly optimistic) hope that I'll get some time to look into this next week. I'm also moving it to the crossplane-runtime repo, since that's where the offending code lives.
Currently I'm trying to reproduce this issue. If I can, I’m thinking of leveraging the optimistic concurrency control we get on writes to the API server to try to address the issue. For the handful of RouteTable-like resources we have I’m hoping we could change the flow to be:
crossplane.io/external-name-pending: $timestamp
to the relevant resource.crossplane.io/external-name
annotation like we do today.This won’t prevent us from leaking external resources with non-deterministic IDs, because that’s pretty much impossible to prevent AFAICT. It’s always possible that we’ll error out while persisting the external-name annotation and need to requeue. However by writing an annotation before we try create we can guarantee that we have the latest version of the (e.g.) RouteTable because the API server will fail the write if we don’t. We can also avoid transparently creating the RouteTable twice and at least warn the user that we potentially leaked one.
@dee0sap do you have any tips on how I might reproduce this? I'm using the below XR and Composition that seem roughly the same as yours, but I'm yet to find a leaked route table.
https://gist.github.com/negz/e1f2e74f18802d15440214a1a1abc981
@negz Yeah, your XR and composition seem the same ( at least at a glance )
Maybe it is a question of a scale? In our case this code is being run by
That said, at least in the case of the set of logs I sent you last, and that have spent the most time staring at, I saw that the route table had to be requeued at least a few times because the referenced resources ( vpc, natgateway ) weren't available yet. Not sure that is relevant or how you could force it to happen if it is.
Are you trying to reproduce the situation 'for real' or are you using mocks within a unit test?
I believe I've reproduced this (for real) using the script above, applying leaktest.yaml
against an AWS account. I didn't keep count, but I created and destroyed ~1,000 instances of the LeakTest
over the course of ~10 hours. When I checked in this morning I notice two (of ~1,000) VPCs were not able to be deleted due to extra RouteTables. I need to confirm that these were created by Crossplane, but I expect they were.
Cool! :)
Btw I in our case the extra route tables
So I can spot suspects quickly just be looking for nameless, non-main, route tables. I then check for no tags and check cloud trails to see if the deployment user created it. If I see that then I know I am looking at a leak.
That makes sense. I unfortunately don't have CloudTrail enabled so I can't use that to confirm, but these routes are missing tags. Missing tags makes sense because Crossplane actually adds tags after creation in an update pass for this particular resource. When you mention routes being nameless, are you referring to the "Name" tag?
See the screenshot. Think that is the same as the name tag but I am not sure.
Hah. I looked all over in the console and somehow missed that column that was right in front of me. I just confirmed the name is indeed just the value of the Name
tag.
I'm pretty confident what we're seeing is actually eventual consistency in the AWS API. I do think the crossplane-runtime issue we identified is real, but I haven't actually observed it in practice. I noticed the following log lines while testing a potential fix (https://github.com/crossplane/crossplane-runtime/pull/279):
2021-08-14T08:37:33.972Z DEBUG provider-aws Successfully requested creation of external resource {"controller": "managed/routetable.ec2.aws.crossplane.io", "request": "/test-1-s6hcj", "uid": "ff9f2de0-866e-41d1-956a-6719578f15fb", "version": "221056", "external-name": ""}
2021-08-14T08:37:34.751Z DEBUG provider-aws Successfully requested creation of external resource {"controller": "managed/routetable.ec2.aws.crossplane.io", "request": "/test-1-s6hcj", "uid": "ff9f2de0-866e-41d1-956a-6719578f15fb", "version": "221492", "external-name": "rtb-0673d041d56da694f"}
What's interesting about this is that the value of "external-name"
is actually set before the Create
call, and not updated after. So the second log line indicates that we already had our external name annotation set to rtb-0673d041d56da694f
, but when we called the RouteTable
Observe
method it reported that that route table rtb-0673d041d56da694f
did not exist. I suspected this could be due to the DescribeRouteTablesRequest
AWS call sometimes lagging a little behind, and it appears that's the case. You can see in the below Terraform PR that they demonstrated the issue and built in some retries in order to wait for the route table to exist:
https://github.com/hashicorp/terraform-provider-aws/pull/1791#issue-144126468
I suspect the fix here will be to add a reasonable amount of retries with exponential backoff around the DescribeRouteTablesRequest
call in the RouteTable
controller.
@negz I am sure on our side we are hitting the issue that https://github.com/crossplane/crossplane-runtime/pull/279 fixes. So I wouldn't say it is hypothetical.
From the file vpc-mc-i540621-dmi-ds67r-k6xxz-events-with-times-order-fixed.csv, which I supplied some time ago, we have the sequence below. Note the resource versions and external-name values. When that second create is happening external-name isn't set. And this makes sense because when external-name was set originally that produced version 35015 but when the second create is happening Reconcile is working with 35012.
2021-07-29T16:58:22.982Z DEBUG provider-aws Reconciling {""controller"": ""managed/routetable.ec2.aws.crossplane.io"", ""request"": ""/vpc-mc-i540621-dmi-ds67r-k6xxz""}"
2021-07-29T16:58:23.475Z DEBUG provider-aws Successfully requested creation of external resource {""controller"": ""managed/routetable.ec2.aws.crossplane.io"", ""request"": ""/vpc-mc-i540621-dmi-ds67r-k6xxz"", ""uid"": ""db33bd1c-151e-499e-a3e6-7bbda6065d2a"", ""version"": ""34683"", ""external-name"": """"}"
2021-07-29T16:58:23.475Z DEBUG controller-runtime.manager.events Normal {""object"": {""kind"":""RouteTable"",""name"":""vpc-mc-i540621-dmi-ds67r-k6xxz"",""uid"":""db33bd1c-151e-499e-a3e6-7bbda6065d2a"",""apiVersion"":""ec2.aws.crossplane.io/v1beta1"",""resourceVersion"":""35015""}, ""reason"": ""CreatedExternalResource"", ""message"": ""Successfully requested creation of external resource""}"
2021-07-29T16:58:23.574Z DEBUG provider-aws Reconciling {""controller"": ""managed/routetable.ec2.aws.crossplane.io"", ""request"": ""/vpc-mc-i540621-dmi-ds67r-k6xxz""}"
2021-07-29T16:58:23.780Z DEBUG provider-aws Successfully requested creation of external resource {""controller"": ""managed/routetable.ec2.aws.crossplane.io"", ""request"": ""/vpc-mc-i540621-dmi-ds67r-k6xxz"", ""uid"": ""db33bd1c-151e-499e-a3e6-7bbda6065d2a"", ""version"": ""35012"", ""external-name"": """"}"
2021-07-29T16:58:23.780Z DEBUG controller-runtime.manager.events Normal {""object"": {""kind"":""RouteTable"",""name"":""vpc-mc-i540621-dmi-ds67r-k6xxz"",""uid"":""db33bd1c-151e-499e-a3e6-7bbda6065d2a"",""apiVersion"":""ec2.aws.crossplane.io/v1beta1"",""resourceVersion"":""35017""}, ""reason"": ""CreatedExternalResource"", ""message"": ""Successfully requested creation of external resource""}"
@dee0sap Understood - thanks for pointing that out. I haven't been able to reproduce that particular variant yet but it does make sense that it's possible.
Adding a note here that the same eventually consistent AWS API issue appears to be affecting InternetGateways too - they just happen to not block deletion of the VPC. I noticed a bunch were building up during my testing, and confirmed that the Terraform folks have observed this too per https://github.com/hashicorp/terraform-provider-aws/blob/915f5f5dc3073de0f25ed216dca421bf76e114b9/aws/resource_aws_internet_gateway.go#L365.
Just dropping this link @chlunde found for posterity. It confirms the EC2 API is intended to be eventually consistent.
What happened?
After deleting a composite and waiting, with kubectl wait, for all of the referenced api objects to be deleted, managed resources still existed in AWS. This is a problem for us because this wait logic is a gate used to delay deletion of our k8s cluster. If the managed resources in AWS aren't cleaned up by the time we are done waiting they will most likely end up orphaned. ( We are seeing this often )
How can we reproduce it?
Here is the clean up logic we are performing before allowing the cluster to be deleted
We expect that once we have run through the above steps all managed resources have been deleted from AWS. However this doesn't seem to be the case.
In the attached zip are example
Also in the zip is events-for-vpc-14449-d66-dmi.xlsx which shows the events around one of the leaked VPC.
At line 81 you can see the first event from my team's clean up code. By line 31, all of the resources related to the VPC have been cleaned up from k8s. There is nothing left for our clean up code to delete or wait for. Our last attempt to wait for a resources to be cleaned up is about 9 seconds after the last event for the VPC in CloudTrails. Given all of the deletion attempts that failed with Client.DependencyViolation I am wondering if Crossplane has given up. ( And why are all those failures there in the first place? ) Anyhoo once our cleanup code has nothing left to wait for cluster Gardener is free to begin with the cluster deletion. You can see at line 28, about 45 seconds after my team's clean up code finished, where it begins taking destructive action. It starts with simply going through the api server and deleting all CRD that it can find. Since Gardener is just blindly deleting things if the AWS resources haven't been cleaned up at this point it is highly likely they are going to be orphaned. 14449-d66-dmi-leak.zip
What environment did it happen in?
crossplane:v1.2.2 aws provider provider-aws:v0.18.1 kubernetes 1.19.10