aws-controllers-k8s / community

AWS Controllers for Kubernetes (ACK) is a project enabling you to manage AWS services from Kubernetes
https://aws-controllers-k8s.github.io/community/
Apache License 2.0
2.36k stars 248 forks source link

Critical error in FlowLog code #1931

Open mattzech opened 8 months ago

mattzech commented 8 months ago

Describe the bug Any time a FlowLog. resource is created, the controller crashes with the following error message:

panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 1860 [running]:
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile.func1()
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:119 +0x1e5
panic({0x2061360?, 0xc0007ac1f8?})
        /usr/local/go/src/runtime/panic.go:914 +0x21f
github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(*resourceManager).sdkCreate(0xc001625ce0, {0x26a14a0, 0xc000ffa330}, 0xc001620120)
        /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/sdk.go:251 +0x594
github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(*resourceManager).Create(0xc000ffa300?, {0x26a14a0?, 0xc000ffa330?}, {0x26a99d8?, 0xc001620120})
        /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/manager.go:127 +0x4b
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).createResource(0xc001124d00, {0x26a14a0, 0xc000ffa330}, {0x26a9970, 0xc001625ce0}, {0x26a99d8, 0xc001620120})
        /go/pkg/mod/github.com/aws-controllers-k8s/runtime@v0.27.1/pkg/runtime/reconciler.go:417 +0x33f
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).Sync(0xc001124d00, {0x26a14a0, 0xc000ffa330}, {0x26a9970, 0xc001625ce0}, {0x26a99d8, 0xc001620120})
        /go/pkg/mod/github.com/aws-controllers-k8s/runtime@v0.27.1/pkg/runtime/reconciler.go:275 +0x645
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).reconcile(0xc001124d00, {0x26a14a0, 0xc000ffa330}, {0x26a9970, 0xc001625ce0}, {0x26a99d8, 0xc001620120})
        /go/pkg/mod/github.com/aws-controllers-k8s/runtime@v0.27.1/pkg/runtime/reconciler.go:215 +0x1cd
github.com/aws-controllers-k8s/runtime/pkg/runtime.(*resourceReconciler).Reconcile(0xc001124d00, {0x26a14a0, 0xc000a20db0}, {{{0xc000594210?, 0x0?}, {0xc000594200?, 0x4105a5?}}})
        /go/pkg/mod/github.com/aws-controllers-k8s/runtime@v0.27.1/pkg/runtime/reconciler.go:186 +0x75b
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile(0x26a14a0?, {0x26a14a0?, 0xc000a20db0?}, {{{0xc000594210?, 0x1ca8ee0?}, {0xc000594200?, 0x0?}}})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:122 +0xb7
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler(0xc0011f8820, {0x26a14d8, 0xc000910280}, {0x1f297c0?, 0xc0000a2440?})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:323 +0x345
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem(0xc0011f8820, {0x26a14d8, 0xc000910280})
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:274 +0x1c9
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2()
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:235 +0x79
created by sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2 in goroutine 1216
        /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:231 +0x565

I can propose a change in the ec2-controller repository pointing out the problem

Steps to reproduce Create a FlowLog with basic spec

apiVersion: ec2.services.k8s.aws/v1alpha1
kind: FlowLog
metadata:
  name: flowlog-test
  namespace: test
spec:
  logDestination: "arn:aws:s3:::vpc-flowlogs-test"
  logDestinationType: s3
  resourceID: vpc-03b59daf7264a79c7
  resourceType: VPC
  trafficType: ALL

Expected outcome A FlowLog created successfully

Environment Openshift 4.13.12

a-hilaly commented 8 months ago

I don't know why is the ec2 API returning an empty FlowLogs array.. Do you see any errors returned by the create step? Also could you confirm whether the resource is getting created or not?

cc-ing @nnbu since he worked on this feature.

mattzech commented 8 months ago

Thanks for looking into this @a-hilaly @nnbu

  1. No other errors in the create step. I will include all the logs below.
    
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<<< rm.sdkCreate       {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<< r.createResource    {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   >>> r.ensureConditions  {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   >>>> rm.IsSynced        {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<<< rm.IsSynced        {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   <<< r.ensureConditions  {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
    2023-10-26T17:04:10.829Z        DEBUG   ackrt   << r.Sync       {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1, "error": "resource not found"}
    2023-10-26T17:04:10.829Z        INFO    Observed a panic in reconciler: runtime error: index out of range [0] with length 0     {"controller": "flowlog", "controllerGroup": "ec2.services.k8s.aws", "controllerKind": "FlowLog", "FlowLog": {"name":"vpc-flowlog-test","namespace":"netgen-test"}, "namespace": "netgen-test", "name": "vpc-flowlog-test", "reconcileID": "46854c76-b0fd-4eb0-b7ab-52b4451034c6"}
    panic: runtime error: index out of range [0] with length 0 [recovered]
        panic: runtime error: index out of range [0] with length 0

goroutine 1821 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:119 +0x1e5 panic({0x2061360?, 0xc000907fc8?}) /usr/local/go/src/runtime/panic.go:914 +0x21f github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(resourceManager).sdkCreate(0xc0002a14a0, {0x26a14a0, 0xc000fff200}, 0xc0014a84c8) /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/sdk.go:251 +0x594


2. Confirming that resource **does not** get created in AWS
Thank you
a-hilaly commented 8 months ago

Could you change the log level to debug @mattzech ? I believe the API might be returning an error.

a-hilaly commented 8 months ago

But still very weird that we reach L251 with 0 elements in the FlowLogs array

nnbu commented 8 months ago

I will check further. But if there is not error at L212 , then resource should have been created.

Could you also try executing corresponding CLI command to create flowlog and see if you get the flowlog in the response aws ec2 create-flow-logs \ --resource-type VPC \ --resource-ids vpc-00112233344556677 \ --traffic-type ALL \ --log-destination-type s3 \ --log-destination arn:aws:s3:::flow-log-bucket/my-custom-flow-logs

(Of course, edit this with the parameters that you provided)

(Taken from https://docs.aws.amazon.com/cli/latest/reference/ec2/create-flow-logs.html )

nnbu commented 8 months ago

I am assuming VPC and logDestination exist

mattzech commented 8 months ago

Hi @nnbu yes it is very strange that is not hitting the error from L212. I ran the command and it produced the desired output

{
    "ClientToken": "123456789NYzyP0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [
        "fl-0001112223334445555"
    ],
    "Unsuccessful": []
}
a-hilaly commented 8 months ago

Is the issue reproducible @mattzech ? i'm wondering whether we hit an API hiccup. Also looks like the response has a Unsuccessful field... which might mean that the API doesn't return an error but a 200 + Unsuccessful when things go wrong... @nnbu Should we check for Unsuccessful size after the error check in L212? and return an error if any..

mattzech commented 8 months ago

Yes the error is reproducible. Looking at the logs I sent earlier, Debug mode seems to be on already. I tested the command with a nonexistent bucket to see how the Unsuccessful field populates:

{
    "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [],
    "Unsuccessful": [
        {
            "Error": {
                "Code": "400",
                "Message": "LogDestination: bad-bucket-test does not exist"
            },
            "ResourceId": "vpc-000112233445566"
        }
    ]
}
mattzech commented 8 months ago

If the error reporting could print the Unsuccessful[0].Error.Message that could give some more insight on what is happening. I will run some tests to see if it could be a permissions issue

a-hilaly commented 8 months ago

What's the HTTP code sent with this response? I assume 200?

{
    "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [],
    "Unsuccessful": [
        {
            "Error": {
                "Code": "400",
                "Message": "LogDestination: bad-bucket-test does not exist"
            },
            "ResourceId": "vpc-000112233445566"
        }
    ]
}
a-hilaly commented 8 months ago

This makes more sense now. We should rely on Unsuccessful field to determine whether a resource was created successfully or not. cc @nnbu

mattzech commented 8 months ago

What's the HTTP code sent with this response? I assume 200?

{
    "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
    "FlowLogIds": [],
    "Unsuccessful": [
        {
            "Error": {
                "Code": "400",
                "Message": "LogDestination: bad-bucket-test does not exist"
            },
            "ResourceId": "vpc-000112233445566"
        }
    ]
}

This one returned 200. I tried with incorrect permissions and that request returned an error with a 254 Response Code

mattzech commented 8 months ago
resp, err = rm.sdkapi.CreateFlowLogsWithContext(ctx, input)
    rm.metrics.RecordAPICall("CREATE", "CreateFlowLogs", err)
    if err != nil {
        return nil, err
    }
        if resp.Unsuccessful[0] != nil {
                 return nil, <error message from response>
        }

I think checking both err and Unsuccessful makes sense

nnbu commented 8 months ago

If Unsuccessful then we go to L245 . So, it is worth checking if Unsuccessful condition happened, and if so, don't go to L251

nnbu commented 8 months ago

@mattzech Is your LogDestination non existent?

nnbu commented 8 months ago
resp, err = rm.sdkapi.CreateFlowLogsWithContext(ctx, input)
  rm.metrics.RecordAPICall("CREATE", "CreateFlowLogs", err)
  if err != nil {
      return nil, err
  }
        if resp.Unsuccessful[0] != nil {
                 return nil, <error message from response>
        }

I think checking both err and Unsuccessful makes sense

I dont think, returning early is correct approach here. We should just skip L251 in case Unsuccessful is non nil.

mattzech commented 8 months ago

@mattzech Is your LogDestination non existent?

It exists, I was able to successfully execute the create-flow-logs command using the same information from the yaml above. I was just testing the command with an incorrect bucket to see how it returned.

nnbu commented 8 months ago

ok. In that case, even after avoiding this crash, your problem would not be solved. Because the resource is not getting created anyway.

mattzech commented 8 months ago

If the error message showed the message inside the Unsuccessful field, I would at least know more about the problem. As it stands, I have no idea what is causing this

mattzech commented 8 months ago

Additionally, the panic is causing the controller pod to crash and restart every time it gets to this point. So avoiding the index out of bounds error would be beneficial to allow the reconciler to continue

a-hilaly commented 8 months ago

I dont think, returning early is correct approach here. We should just skip L251 in case Unsuccessful is non nil.

@nnbu doesn't this mean that resource creation encountered an issue and that we should retry or set a terminal condition?

a-hilaly commented 8 months ago

Or, maybe, we could return a re-queue error to force recreation after few seconds?

ack-bot commented 2 months ago

Issues go stale after 180d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 60d of inactivity and eventually close. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle stale

ack-bot commented 1 week ago

Stale issues rot after 60d of inactivity. Mark the issue as fresh with /remove-lifecycle rotten. Rotten issues close after an additional 60d of inactivity. If this issue is safe to close now please do so with /close. Provide feedback via https://github.com/aws-controllers-k8s/community. /lifecycle rotten