crossplane / upjet

A code generation framework and runtime for Crossplane providers
Apache License 2.0
320 stars 90 forks source link

Recover from panics in async external clients #428

Closed mergenci closed 3 months ago

mergenci commented 3 months ago

Description of your changes

This PR implements a panic recovery mechanism for async external clients. Controller runtime can be configured to recover from panics, which Crossplane runtime adopted in https://github.com/crossplane/crossplane-runtime/pull/493. Unfortunately, that recovery mechanism doesn't apply to the Goroutines started by async external clients (an example). Therefore, panics in async code would crash pods as reported in https://github.com/crossplane-contrib/provider-upjet-aws/issues/1242.

With this PR, panic is logged and reconciliation continues, without the pod crashing.

2024-08-22T12:20:22+03:00   DEBUG   provider-aws    Async create ended. {"trackerUID": "b1a58d9b-2236-4740-9a90-c169b5d2e538", "resourceName": "test-async-panic-handler-securitygroupingressrule", "gvk": "ec2.aws.upbound.io/v1beta1, Kind=SecurityGroupIngressRule", "error": "async create failed: panic: runtime error: index out of range [0] with length 0 [recovered]"}

There is one behavioral difference compared to the sync external client panic handling: Upon panicking during resource creation, async external client continues reconciling the resource and panicking each time, whereas sync external client stops reconciliation because of the existence of external-create-failed annotation on the resource. The async behavior is different, because Crossplane runtime is not async aware. Async clients let Crossplane runtime set external-create-succeeded annotation in order to be able to do the creation asynchronously. Having external-create-succeeded annotation prevents reconciliation from stopping even in the existence of external-create-failed annotation.

I have:

How has this code been tested

Thanks to @haarchri's discovery, I triggered a panic using the manifest below. SecurityGroupIngressRule is a Terraform Plugin Framework resource, whereas SecurityGroup and VPC are Plugin SDKv2 resources.

apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroupIngressRule
metadata:
  name: test-async-panic-handler-securitygroupingressrule
spec:
  forProvider:
    # Omitting `cidrIpv4` panics the provider.
    # cidrIpv4: 10.0.0.0/8
    fromPort: 8080
    ipProtocol: tcp
    region: us-west-1
    securityGroupIdRef:
      name: test-async-panic-handler-securitygroup
    tags:
      Name: test-async-panic-handler-securitygroupingressrule
    toPort: 8080

---
apiVersion: ec2.aws.upbound.io/v1beta1
kind: SecurityGroup
metadata:
  name: test-async-panic-handler-securitygroup
spec:
  forProvider:
    region: us-west-1
    tags:
      Name: test-async-panic-handler-securitygroup
    vpcIdRef:
      name: test-async-panic-handler-vpc

---
apiVersion: ec2.aws.upbound.io/v1beta1
kind: VPC
metadata:
  name: test-async-panic-handler-vpc
spec:
  forProvider:
    cidrBlock: 172.16.0.0/16
    region: us-west-1
    tags:
      Name: test-async-panic-handler-vpc

Terraform provider panics with message:

E0822 12:19:21.286186   84654 runtime.go:79] Observed a panic: runtime.boundsError{x:0, y:0, signed:true, code:0x0} (runtime error: index out of range [0] with length 0)