aws / amazon-vpc-resource-controller-k8s

Controller for managing Trunk & Branch Network Interfaces on EKS Cluster using Security Group For Pod feature and IPv4 Addresses for Windows Node.
Apache License 2.0
74 stars 47 forks source link

Unpaginated calls to DescribeNetworkInterfaces getting blocked by EC2 #188

Open ejholmes opened 1 year ago

ejholmes commented 1 year ago

Describe the Bug:

We've been in the process of rolling out security groups for pods across various EKS clusters in multiple AWS accounts that we own. Recently, we've attempted to do the same in one of our "larger" AWS accounts, and noticed that pods would hang indefinitely waiting for an ENI to be attached by vpc-resource-controller.

We recently raised the ENI limits on this account to a very large value, and was told that unpaginated calls to DescribeNetworkInterfaces would be blocked as a result, and it seems that this is impacting vpc-resource-controller as you can see here in this CloudTrail event:

{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "<REDACTED>:amazon-vpc-resource-controller-k8s",
        "arn": "arn:aws:sts::<REDACTED>:assumed-role/<REDACTED>/amazon-vpc-resource-controller-k8s",
        "accountId": "<REDACTED>",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "<REDACTED>",
                "arn": "arn:aws:iam::<REDACTED>:role/<REDACTED>-control-20210716225641340500000009",
                "accountId": "<REDACTED>",
                "userName": "<REDACTED>-control-20210716225641340500000009"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2023-03-14T21:23:53Z",
                "mfaAuthenticated": "false"
            }
        },
        "invokedBy": "eks.amazonaws.com"
    },
    "eventTime": "2023-03-14T21:35:11Z",
    "eventSource": "ec2.amazonaws.com",
    "eventName": "DescribeNetworkInterfaces",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "eks.amazonaws.com",
    "userAgent": "eks.amazonaws.com",
    "errorCode": "Client.OperationNotPermitted",
    "errorMessage": "This operation is not permitted.",
    "requestParameters": {
        "networkInterfaceIdSet": {},
        "filterSet": {
            "items": [
                {
                    "name": "tag:vpcresources.k8s.aws/trunk-eni-id",
                    "valueSet": {
                        "items": [
                            {
                                "value": "eni-019dd4ee9f4aa06c8"
                            }
                        ]
                    }
                }
            ]
        }
    },
    "responseElements": null,
    "requestID": "8fd9723c-c4fa-4dd2-82be-47d8f328bf8e",
    "eventID": "fd8c44a0-b04c-430d-9e77-8c478b0f2de3",
    "readOnly": true,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "<REDACTED>",
    "eventCategory": "Management"
}

For AWS engineers, our case # is 12229078111

Observed Behavior:

Calls to DescribeNetworkInterfaces are blocked by EC2 with a Client.OperationNotPermitted error code

Expected Behavior:

SGP should work even if unpaginated calls are blocked.

How to reproduce it (as minimally and precisely as possible):

Increase ENI limits in an AWS account high enough that EC2 blocks unpaginated calls to DescribeNetworkInterfaces

Additional Context:

Environment:

haouc commented 1 year ago

@ejholmes thanks for reporting the issue. As you mentioned, calls to EC2 shouldn't be blocked by EC2 just because they are unpaginated. I have found the request ID in our system. We need to check with EC2 team for the root cause. Thanks.

ejholmes commented 1 year ago

Just to update this issue for anybody else that manages to run into this, the issue was from being blocked by EC2 on ec2:DescribeNetworkInterfaces. We discovered an issue where Glue was leaking ENI's and leaving them orphaned, and managed to clean those up with some effort, and lowered our limit on ENI's (we were told we needed to be under 20k to be unblocked).

Once that was done, and EC2 unblocked us from ec2:DescribeNetworkInterfaces, security groups for pods was functional in the account again.

haouc commented 10 months ago

@ejholmes sorry for a late response. Glad to hear it works now. We are taking an action to explore if we can re-arch the workflow managing interfaces to enable paginated API calls. The reason we can't enable it now is the pagination call won't guarantee the count of response pages which means we may have uncertain API calls to make based on how many pages EC2 can return. The risk is this will likely put a risk on user's account limit and throttle user's account if the pagintating iteration is large enough. I will keep this issue open for now for tracking purpose till we have an approach as an update. Thanks.