Closed IxDay closed 1 month ago
Thanks for the fix @IxDay 🙏 @erhancagirici reminded me that we had used context.withoutCancel()
, because propagating ctx
would cause early cancelation issues. Have you observed such issues? I'm not sure if they were easily observable, by the way.
Yes, I was editing my message to ask if there was a reason to choose this, it was not obvious from the code/history/comment. Do you have any idea how those early cancellation event would materialize? For the moment we are monitoring our claim on their "Readiness" status. Which was flaky when memory was reaching the limit (the underlying provider was not able to issue http calls due to garbage collection pressure). However, since the release of the patch, we did not observe any de-sync yet. To give you an idea of the timeframe, we rolled the change on Monday on ~150 claims (which are composed of ~2 crossplane gcp provider objects). Then on ~170 more yesterday. For the moment, we did not notice any unexpected behavior. But once again, we might not be tracking the right thing here, so if you have more details, I would be happy to investigate.
We are starting to roll this across our entire infrastructure. We still haven't notice any issue yet. I am bumping this channel in order to make this move forward.
Regarding my previous message, do you have any insight to share with me in order to reproduce the previous bug which motivated the introduction of WithoutCancel
?
@IxDay, I've been working on this issue from time to time. The fact that you haven't had any issues is great news. I've scheduled a meeting with a team member next week. They have more experience in the problem. Having a memory leak greatly disturbs me. Now that I've addressed some of my urgent tasks, this issue will be at the top of my priority list.
Thank you @ulucinar for providing background information off-channel. We hypothesized that the implementation was ported from Azure provider, which had context cancelation issues when the context was propagated. It is likely that GCP provider would never had any issues, in the first place, even if the context was propagated.
We will run some simple tests on our side. If all goes well, we will merge.
Let me know if you need anything. We are really looking forward for this PR to be merged
/test-examples="examples/container/v1beta2/cluster.yaml"
/test-examples="examples/cloudplatform/v1beta1/serviceaccount.yaml"
/test-examples="examples/storage/v1beta2/bucket.yaml"
I will take a look, but I will need a few days before, since I have other priorities at the moment.
This patch fixes the propagation of context cancellation through the call stack. It prevents leaks of channel and goroutine from the terraform provider.
Description of your changes
In order to fix this bug we tracked down the leak to the underlying terraform provider. We managed to isolate this function: provider code using pprof. By adding it to our deployment, we noticed the creation of 2 channels and 2 goroutines on each resource every time the reconciliation is kicking. All the never closing routines had the same stack trace:
As we can see the routine is waiting on the closing of the
Done
channel from the parent "process". However, we see in the controller bootstraping that we are overriding the parent context withWithoutCancellation
context. Implementation shows from source code that channel is nownil
. And anil
channel will never close and block the goroutine as showcased in this playground demoFixes #538
I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested