Open mattzech opened 8 months ago
I don't know why is the ec2 API returning an empty FlowLogs array.. Do you see any errors returned by the create step? Also could you confirm whether the resource is getting created or not?
cc-ing @nnbu since he worked on this feature.
Thanks for looking into this @a-hilaly @nnbu
2023-10-26T17:04:10.829Z DEBUG ackrt <<<< rm.sdkCreate {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z DEBUG ackrt <<< r.createResource {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z DEBUG ackrt >>> r.ensureConditions {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z DEBUG ackrt >>>> rm.IsSynced {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z DEBUG ackrt <<<< rm.IsSynced {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z DEBUG ackrt <<< r.ensureConditions {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1}
2023-10-26T17:04:10.829Z DEBUG ackrt << r.Sync {"account": "*********", "role": "arn:aws:iam::*********:role/ack-controllers", "region": "eu-west-1", "kind": "FlowLog", "namespace": "netgen-test", "name": "vpc-flowlog-test", "is_adopted": false, "generation": 1, "error": "resource not found"}
2023-10-26T17:04:10.829Z INFO Observed a panic in reconciler: runtime error: index out of range [0] with length 0 {"controller": "flowlog", "controllerGroup": "ec2.services.k8s.aws", "controllerKind": "FlowLog", "FlowLog": {"name":"vpc-flowlog-test","namespace":"netgen-test"}, "namespace": "netgen-test", "name": "vpc-flowlog-test", "reconcileID": "46854c76-b0fd-4eb0-b7ab-52b4451034c6"}
panic: runtime error: index out of range [0] with length 0 [recovered]
panic: runtime error: index out of range [0] with length 0
goroutine 1821 [running]: sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).Reconcile.func1() /go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.5/pkg/internal/controller/controller.go:119 +0x1e5 panic({0x2061360?, 0xc000907fc8?}) /usr/local/go/src/runtime/panic.go:914 +0x21f github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log.(resourceManager).sdkCreate(0xc0002a14a0, {0x26a14a0, 0xc000fff200}, 0xc0014a84c8) /github.com/aws-controllers-k8s/ec2-controller/pkg/resource/flow_log/sdk.go:251 +0x594
2. Confirming that resource **does not** get created in AWS
Thank you
Could you change the log level to debug @mattzech ? I believe the API might be returning an error.
But still very weird that we reach L251 with 0 elements in the FlowLogs array
I will check further. But if there is not error at L212 , then resource should have been created.
Could you also try executing corresponding CLI command to create flowlog and see if you get the flowlog in the response
aws ec2 create-flow-logs \ --resource-type VPC \ --resource-ids vpc-00112233344556677 \ --traffic-type ALL \ --log-destination-type s3 \ --log-destination arn:aws:s3:::flow-log-bucket/my-custom-flow-logs
(Of course, edit this with the parameters that you provided)
(Taken from https://docs.aws.amazon.com/cli/latest/reference/ec2/create-flow-logs.html )
I am assuming VPC and logDestination exist
Hi @nnbu yes it is very strange that is not hitting the error from L212. I ran the command and it produced the desired output
{
"ClientToken": "123456789NYzyP0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
"FlowLogIds": [
"fl-0001112223334445555"
],
"Unsuccessful": []
}
Is the issue reproducible @mattzech ? i'm wondering whether we hit an API hiccup. Also looks like the response has a Unsuccessful
field... which might mean that the API doesn't return an error but a 200 + Unsuccessful
when things go wrong...
@nnbu Should we check for Unsuccessful
size after the error check in L212? and return an error if any..
Yes the error is reproducible. Looking at the logs I sent earlier, Debug mode seems to be on already. I tested the command with a nonexistent bucket to see how the Unsuccessful
field populates:
{
"ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
"FlowLogIds": [],
"Unsuccessful": [
{
"Error": {
"Code": "400",
"Message": "LogDestination: bad-bucket-test does not exist"
},
"ResourceId": "vpc-000112233445566"
}
]
}
If the error reporting could print the Unsuccessful[0].Error.Message
that could give some more insight on what is happening. I will run some tests to see if it could be a permissions issue
What's the HTTP code sent with this response? I assume 200?
{
"ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=",
"FlowLogIds": [],
"Unsuccessful": [
{
"Error": {
"Code": "400",
"Message": "LogDestination: bad-bucket-test does not exist"
},
"ResourceId": "vpc-000112233445566"
}
]
}
This makes more sense now. We should rely on Unsuccessful
field to determine whether a resource was created successfully or not. cc @nnbu
What's the HTTP code sent with this response? I assume 200?
{ "ClientToken": "123456789P0I4Zd9kSRul7eqDlBXPkCGsCFZw=", "FlowLogIds": [], "Unsuccessful": [ { "Error": { "Code": "400", "Message": "LogDestination: bad-bucket-test does not exist" }, "ResourceId": "vpc-000112233445566" } ] }
This one returned 200. I tried with incorrect permissions and that request returned an error with a 254 Response Code
resp, err = rm.sdkapi.CreateFlowLogsWithContext(ctx, input)
rm.metrics.RecordAPICall("CREATE", "CreateFlowLogs", err)
if err != nil {
return nil, err
}
if resp.Unsuccessful[0] != nil {
return nil, <error message from response>
}
I think checking both err
and Unsuccessful
makes sense
@mattzech Is your LogDestination
non existent?
resp, err = rm.sdkapi.CreateFlowLogsWithContext(ctx, input) rm.metrics.RecordAPICall("CREATE", "CreateFlowLogs", err) if err != nil { return nil, err } if resp.Unsuccessful[0] != nil { return nil, <error message from response> }
I think checking both
err
andUnsuccessful
makes sense
I dont think, returning early is correct approach here. We should just skip L251 in case Unsuccessful
is non nil.
@mattzech Is your
LogDestination
non existent?
It exists, I was able to successfully execute the create-flow-logs
command using the same information from the yaml above. I was just testing the command with an incorrect bucket to see how it returned.
ok. In that case, even after avoiding this crash, your problem would not be solved. Because the resource is not getting created anyway.
If the error message showed the message inside the Unsuccessful
field, I would at least know more about the problem. As it stands, I have no idea what is causing this
Additionally, the panic is causing the controller pod to crash and restart every time it gets to this point. So avoiding the index out of bounds error would be beneficial to allow the reconciler to continue
I dont think, returning early is correct approach here. We should just skip L251 in case Unsuccessful is non nil.
@nnbu doesn't this mean that resource creation encountered an issue and that we should retry or set a terminal condition?
Or, maybe, we could return a re-queue error to force recreation after few seconds?
Issues go stale after 180d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 60d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle stale
Stale issues rot after 60d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten
.
Rotten issues close after an additional 60d of inactivity.
If this issue is safe to close now please do so with /close
.
Provide feedback via https://github.com/aws-controllers-k8s/community.
/lifecycle rotten
Describe the bug Any time a FlowLog. resource is created, the controller crashes with the following error message:
I can propose a change in the ec2-controller repository pointing out the problem
Steps to reproduce Create a FlowLog with basic spec
Expected outcome A FlowLog created successfully
Environment Openshift 4.13.12