arkime / aws-aio

Apache License 2.0
8 stars 3 forks source link

Error when adding a capture VPC: VPC Endpoint Servicdoes not support Availability Zone #147

Closed spwx closed 7 months ago

spwx commented 8 months ago

Good day,

Have a fresh install of this great project. Cluster creation completed without error. However, unable to add a capture VPC, due to the following error:

Resource handler returned message: "The VPC endpoint service com.amazonaws.vpce.us-east-1.vpce-svc-01f0fa7502b9b4481 does not support the availability zone of the subnet: subnet-08a4d799ec3c08a58.

subnet-08a4d799ec3c08a58 is in the us-east-1f availability zone.

CLI output below, and error log output attached.

Thanks!

Manage Arkime Log

manage_arkime_failure.log

Manage Arkime CLI Output

> ./manage_arkime.py --region us-east-1 vpc-add --cluster-name MyCluster --vpc-id vpc-0bd158d740efc0fe7
2024-01-09 14:18:19 - Debug-level logs save to file: /local/home/spw/projects/aws-aio/manage_arkime/manage_arkime.log
2024-01-09 14:18:19 - Using AWS Credential Profile: None
2024-01-09 14:18:19 - Using AWS Region: us-east-1
2024-01-09 14:18:20 - Deploying shared mirroring components via CDK...
2024-01-09 14:18:20 - Executing command: deploy MyCluster-vpc-0bd158d740efc0fe7-Mirror
2024-01-09 14:18:20 - NOTE: This operation can take a while.  You can 'tail -f' the logfile to track the status.
2024-01-09 14:21:54 - Deployment failed
Traceback (most recent call last):
  File "/local/home/spw/projects/aws-aio/./manage_arkime.py", line 350, in <module>
    main()
  File "/local/home/spw/projects/aws-aio/./manage_arkime.py", line 346, in main
    cli()
  File "/local/home/spw/projects/aws-aio/.venv/lib/python3.11/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/spw/projects/aws-aio/.venv/lib/python3.11/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/local/home/spw/projects/aws-aio/.venv/lib/python3.11/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/spw/projects/aws-aio/.venv/lib/python3.11/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/spw/projects/aws-aio/.venv/lib/python3.11/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/spw/projects/aws-aio/.venv/lib/python3.11/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/local/home/spw/projects/aws-aio/./manage_arkime.py", line 189, in vpc_add
    cmd_vpc_add(profile, region, cluster_name, vpc_id, force_vni, just_print_cfn)
  File "/local/home/spw/projects/aws-aio/manage_arkime/commands/vpc_add.py", line 112, in cmd_vpc_add
    cdk_client.deploy(stacks_to_deploy, context=vpc_add_context)
  File "/local/home/spw/projects/aws-aio/manage_arkime/cdk_interactions/cdk_client.py", line 79, in deploy
    exceptions.raise_deploy_exceptions(exit_code, stdout)
  File "/local/home/spw/projects/aws-aio/manage_arkime/cdk_interactions/cdk_exceptions.py", line 88, in raise_deploy_exceptions
    raise CdkDeployFailedUnknown()
cdk_interactions.cdk_exceptions.CdkDeployFailedUnknown: The CDK Deploy operation failed for unknown reasons, please check the logs and stdout.
chelma commented 8 months ago

Thanks for submitting - will investigate.

chelma commented 8 months ago

Taking a look, I'm wondering if the issue here is that the VPC is in a different account/region than the Arkime Cluster. AWS error messages can be cryptic and I could see the error message being thrown in that scenario. It seems unlikely that the problem is with the particular AZ.

chelma commented 8 months ago

Tested cross-region and cross-account adds; we do handle them more gracefully and a different error message. See: https://github.com/arkime/aws-aio/blob/main/manage_arkime/commands/vpc_add.py#L83

chelma commented 8 months ago

Taking a closer look at the log file, it appears that the VPC Endpoint Service created for the Arkime Cluster was incompatible with all three subnets in the user's VPC, not just a single one of them, and attempted to create a VPC Endpoint in each of them multiple times.

chelma commented 8 months ago

OK - so one guess is something wacky with permissions to add Gateway LB VPC Endpoints to the GWLB Service [1]. This seems pretty unlikely as the error message isn't permission related.

My other guess is that we must ensure that our Gateway Load Balancer has targets to receive traffic (Capture Nodes) in every AZ that's sending traffic to it [2]. In this scenario, the user has traffic sources in AZs that we don't have Capture Nodes, which is somehow known by the GWLB and it prevents creating VPC Endpoints in those AZs. However... we have cross-zone balancing turned on which is supposed to resolve issues like this [3]... I'll try and test this scenario tomorrow, as it seems more likely than the first.

[1] https://github.com/arkime/aws-aio/blob/main/cdk-lib/capture-stacks/capture-nodes-stack.ts#L222 [2] https://docs.aws.amazon.com/elasticloadbalancing/latest/gateway/target-groups.html#registered-targets [3] https://github.com/arkime/aws-aio/blob/main/cdk-lib/capture-stacks/capture-nodes-stack.ts#L51

chelma commented 8 months ago

User confirmed via Slack that they have subnets in 6 AZs, and their logs seem to indicate that three of them are failing. By default, our Capture VPC only has subnets in 2 AZs [1]. It could be that CloudFormation is only trying 3 AZs and they just happen to be three that we don't have Capture Subnets in, and if it tried all of 6 of them at once 4 would fail. Alternatively, something more interesting could be happening here.

An easy way to test is to up the number of AZs and hosts in our Demo Stacks to replicate this scenario. Will pursue.

[1] https://github.com/arkime/aws-aio/blob/main/manage_arkime/core/capacity_planning.py#L19

chelma commented 8 months ago

Just thinking aloud, but we may need to move to a larger fleet of smaller Capture Nodes to have presence in each AZ. This appears to be exactly what Cross-Zone Balancing [1] is designed for, where as long as the hosts in the entire target group have enough aggregate capacity to handle the load the GWLB will spread traffic from a "hot" AZ across them in a reasonable way and prevent that zone's Capture Node from going under.

The tradeoff is increased latency and a cross-zone dependency, both of which seem like a reasonable sacrifice in this scenario.

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/network/target-group-cross-zone.html

awick commented 8 months ago

Just need to make sure the cross-zone balancing LBing is sticky, having packets from the same session end up in different AZs would be not good ™️

chelma commented 8 months ago

Just need to make sure the cross-zone balancing LBing is sticky, having packets from the same session end up in different AZs would be not good ™️

Thanks for the note. We already have cross-zone balancing enabled for our Capture VPC, so if things currently work then we should be good to go? Would be a good idea to confirm that that do work.

chelma commented 8 months ago

Looks like stickiness is easy to set [1], but we're not currently setting it [2] (oops).

[1] https://docs.aws.amazon.com/elasticloadbalancing/latest/APIReference/API_TargetGroupAttribute.html [2] https://github.com/arkime/aws-aio/blob/main/cdk-lib/capture-stacks/capture-nodes-stack.ts#L49

awick commented 8 months ago

oops, I swear I checked this before, but maybe that was somewhere else or an old version. :(

chelma commented 8 months ago

It looks like there's a few options for ensuring that the Capture VPC is present in all AZs [1] [2]. We'll need to query this before coming up with our capacity plan.

Also - should be obvious but this new way of provisioning the Capture VPC will result in Cfn deployment failures if run against existing stacks, so we'll need to bump the semver for the package.

[1] https://docs.aws.amazon.com/systems-manager/latest/userguide/parameter-store-public-parameters-global-infrastructure.html [2] https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2/client/describe_availability_zones.html

chelma commented 7 months ago

oops, I swear I checked this before, but maybe that was somewhere else or an old version. :(

Actually, it turns out 5-tuple stickiness is the default for GWLB's, it's just that the setting is configured on the Target Group rather than the LB itself and the docs were a bit confusing on that front. So it has been working all along, AFAIK.

chelma commented 7 months ago

OK - think I got this one figured out, and have posted a fix - https://github.com/arkime/aws-aio/pull/152

chelma commented 7 months ago

Code merged. Reaching out to original user to confirm resolution of issue.

chelma commented 7 months ago

Still waiting for user to retry. Closing this issue and will open a new one if further problems are encountered.