Fix goroutine leak when reconciling

IxDay commented 3 months ago

This patch fixes the propagation of context cancellation through the call stack. It prevents leaks of channel and goroutine from the terraform provider.

Description of your changes

In order to fix this bug we tracked down the leak to the underlying terraform provider. We managed to isolate this function: provider code using pprof. By adding it to our deployment, we noticed the creation of 2 channels and 2 goroutines on each resource every time the reconciliation is kicking. All the never closing routines had the same stack trace:

As we can see the routine is waiting on the closing of the Done channel from the parent "process". However, we see in the controller bootstraping that we are overriding the parent context with WithoutCancellation context. Implementation shows from source code that channel is now nil. And a nil channel will never close and block the goroutine as showcased in this playground demo

Fixes #538

I have:

[x] Read and followed Crossplane's contribution process.
[x] Run make reviewable to ensure this PR is ready for review.
[ ] Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

mergenci commented 3 months ago

Thanks for the fix @IxDay 🙏 @erhancagirici reminded me that we had used context.withoutCancel(), because propagating ctx would cause early cancelation issues. Have you observed such issues? I'm not sure if they were easily observable, by the way.

IxDay commented 3 months ago

Yes, I was editing my message to ask if there was a reason to choose this, it was not obvious from the code/history/comment. Do you have any idea how those early cancellation event would materialize? For the moment we are monitoring our claim on their "Readiness" status. Which was flaky when memory was reaching the limit (the underlying provider was not able to issue http calls due to garbage collection pressure). However, since the release of the patch, we did not observe any de-sync yet. To give you an idea of the timeframe, we rolled the change on Monday on ~150 claims (which are composed of ~2 crossplane gcp provider objects). Then on ~170 more yesterday. For the moment, we did not notice any unexpected behavior. But once again, we might not be tracking the right thing here, so if you have more details, I would be happy to investigate.

IxDay commented 3 months ago

We are starting to roll this across our entire infrastructure. We still haven't notice any issue yet. I am bumping this channel in order to make this move forward. Regarding my previous message, do you have any insight to share with me in order to reproduce the previous bug which motivated the introduction of WithoutCancel ?

mergenci commented 2 months ago

@IxDay, I've been working on this issue from time to time. The fact that you haven't had any issues is great news. I've scheduled a meeting with a team member next week. They have more experience in the problem. Having a memory leak greatly disturbs me. Now that I've addressed some of my urgent tasks, this issue will be at the top of my priority list.

mergenci commented 2 months ago

Thank you @ulucinar for providing background information off-channel. We hypothesized that the implementation was ported from Azure provider, which had context cancelation issues when the context was propagated. It is likely that GCP provider would never had any issues, in the first place, even if the context was propagated.

We will run some simple tests on our side. If all goes well, we will merge.

IxDay commented 2 months ago

Let me know if you need anything. We are really looking forward for this PR to be merged

turkenf commented 2 months ago

/test-examples="examples/container/v1beta2/cluster.yaml"

turkenf commented 2 months ago

/test-examples="examples/cloudplatform/v1beta1/serviceaccount.yaml"

turkenf commented 2 months ago

/test-examples="examples/storage/v1beta2/bucket.yaml"

turkenf commented 3 weeks ago

Hi @IxDay,

After this change, we encountered regression with creating and deleting some resources, so we reverted this change. Please see the related issue and PR.

We would be pleased if you could work on resolving this issue and create a new PR to address the regression.

IxDay commented 3 weeks ago

I will take a look, but I will need a few days before, since I have other priorities at the moment.

crossplane-contrib / provider-upjet-gcp

Fix goroutine leak when reconciling #539

Description of your changes

How has this code been tested