aws / amazon-genomics-cli

https://aws.github.io/amazon-genomics-cli/
Apache License 2.0
147 stars 83 forks source link

agc context deploy --context myContext fails due to A load balancer cannot be attached to multiple subnets in the same Availability Zone #500

Open golharam opened 2 years ago

golharam commented 2 years ago

Describe the Bug

Following the instructions to get started, when I run

agc context deploy --context myContext

it fails with the error in CloudFormation:

AWS::ElasticLoadBalancingV2::LoadBalancer | toil/Engine/Resource/LB (toilEngineLB8817BA29) A load balancer cannot be attached to multiple subnets in the same Availability Zone (Service: AmazonElasticLoadBalancing)

Steps to Reproduce

agc account activate --vpc vpc-xyz --subnets subnet-abc,subnet-def
agc configure email myemail@mydomain.com
cd examples/demo-cwl-project
agc context deploy --context myContext

where subnet-abc is in us-east-1d, and subnet-def is in us-east-1e

Relevant Logs

Expected Behavior

Expected behavior is that stack would have gotten created without error.

Actual Behavior

Stacks fails to get created.

Additional Context

Operating System: AGC Version: 1.5.0 Was AGC setup with a custom bucket: no Was AGC setup with a custom VPC: yes

markjschreiber commented 2 years ago

@adamnovak can you take a look at this. Might be an issue with how the infrastructure for Toil is being deployed?

@golharam are you able to deploy different engine types using the same account? For example are you able to deploy one of the contexts from https://github.com/aws/amazon-genomics-cli/tree/main/examples/gatk-best-practices-project-miniwdl. If you are then we know this issue is limited to Toil rather than being a general issue.

adamnovak commented 2 years ago

Here's what AGC builds for Cromwell as the engine (which presumably works with multiple --subnets?):

https://github.com/aws/amazon-genomics-cli/blob/3a5a9b52adf03e44452f63e167ea694e5fa251d3/packages/cdk/lib/stacks/engines/cromwell-engine-construct.ts#L60-L80

And here's what it builds for Toil as the engine:

https://github.com/aws/amazon-genomics-cli/blob/57996a1d566db885fa110f1cee2d5821365b1f49/packages/cdk/lib/stacks/engines/toil-engine-construct.ts#L58-L67

With Cromwell, AGC creates an AWS Lambda serverless function to translate WES to what Cromwell can understand, and the ApiProxy is pointed at the Lambda, which has been told the load balance'rs internal hostname and can make calls against it (this.engine.loadBalancer.loadBalancerDnsName).

With Toil, the service behind the load balancer already speaks WES, so we just point the ApiProxy at the load balancer.

The ApiProxy supports being pointed at either a Lambda or directly at a load balancer, so this should work.

It looks like behind the scenes, it uses a LambdaIntegration from an AWS library when pointed at a Lambda, but makes both a VpcLink and an HttpIntegration when pointing directly at a load balancer.

Unfortunately, I don't have enough experience with either the CDK libraries or AWS's VPC connection, load balancer, and API proxy concepts for all this to suggest to me a reason why this would not work. Since the load balancer is created as part of the engine construction, which I think is really the same between Toil and Cromwell engines except for the containers, maybe the differences in front of the load blancer are red herrings? It's not clear to me whether attaching different things in front of the load balancer can, in the CDK, change how the load balancer itself is to be constructed, and result in it trying to attach to multiple subnets when it would not otherwise be.

I think the real problem here might be that --subnets subnet-abc,subnet-def is asking for something impossible, or at least unsupported. Maybe you can only use multiple subnets for AGC if the subnets are all in different availability zones, and it turns out that in this case they happen to be in the same availability zone?

@markjschreiber Is AGC meant to allow you to use multiple subnets in the same availability zone?

adamnovak commented 2 years ago

The load balancers for Toil and Cromwell are both made by calling renderServiceWithTaskDefinition, passing along the subnets and the engine container. And the resulting engine is what provides the load balancer. So if there's a problem with how that is built, it is going to be in or behind that function.

It looks like you added the subnets feature in https://github.com/aws/amazon-genomics-cli/pull/436 @markjschreiber. So you might be better able to figure this out than me.

markjschreiber commented 2 years ago

There is really no value in using more than one subnet from the same AZ as it will not give any resilience in the case of an AZ failure and it won't give AWS Batch access to an increased pool of EC2s instances to choose from if the subnets are in the same AZ. @golharam , does the deployment succeed if you provide subnet-ids from unique AZs?

golharam commented 2 years ago

@golharam are you able to deploy different engine types using the same account? For example are you able to deploy one of the contexts from https://github.com/aws/amazon-genomics-cli/tree/main/examples/gatk-best-practices-project-miniwdl. If you are then we know this issue is limited to Toil rather than being a general issue.

I have not tried the other engine types. The cloudformation error I'm getting is:

A load balancer cannot be attached to multiple subnets in the same Availability Zone (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: InvalidConfigurationRequest; Request ID: 7c1d1c22-197d-4c3c-b386-333f71b596e1; Proxy: null)

So I don't think its specific to Toil, rather its the AWS resource (AWS::ElasticLoadBalancingV2::LoadBalancer) creation. The subnets I'm specifying are in us-east-1d and us-east-1e which (to my understanding) are two different availability zones. These are the same subnets I'm using for ElasticBeanStalk for deployed apps that are also load balanced.

I can still try another engine type, if you think its worth trying.

golharam commented 2 years ago

There is really no value in using more than one subnet from the same AZ as it will not give any resilience in the case of an AZ failure and it won't give AWS Batch access to an increased pool of EC2s instances to choose from if the subnets are in the same AZ. @golharam , does the deployment succeed if you provide subnet-ids from unique AZs?

Due to the way our AWS networks are set up, I only have access to us-east-1d and us-east-1e. I tried to deploy to a single subnet (that exists in either us-east-1d or us-east-1e), but got an error (I can't recall exactly) that I needed two subnets for the load balancer.

markjschreiber commented 2 years ago

us-east-1d and us-east-1e are indeed different availability zones. It's confusing that the ELB is not seeing this unless it is somehow trying to use the same subnet twice. Could you double check that these two subnet IDs do indeed point to different AZs?

I know for certain that you can deploy a miniwdl engine to specific subnets in different AZs, if you can try that this would give us more diagnostic information.

golharam commented 2 years ago

The subnets do in fact point to different AZs. I'm using those same subnets in other AWS applications that use Elastic LoadBalancer.

I tried deploying the wdl context, but that fails for the same reason:

A load balancer cannot be attached to multiple subnets in the same Availability Zone (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: InvalidConfigurationRequest; Request ID: ...; Proxy: null)

So, basically, I'm stuck and can't deploy AGC at all...

markjschreiber commented 2 years ago

Can you check if the subnet names listed in Systems Manager -> Parameter Store -> /agc/_common/InfraSubnets are the values you believe they should be and that all of these values are in a unique AZ? Can you also check that the value stored for the /agc/_common/NumInfraSubnets parameter matches the number of items in /agc/_common/InfraSubnets?

If they are not then they will need re-setting via agc account deactivate.

If they are then this might indicate a bug with AWS CDK which is used by AGC to generate the cloudformation templates that deploy the infrastructure.

golharam commented 2 years ago

They are both correct. Happy to share details privately, if that will help.

subnet 1 - us-east-1d subnet 2 - us-east-1e

/agc/_common/NumInfraSubnets == 2

markjschreiber commented 2 years ago

Just to confirm that the listed subnets conform to the pattern of a subnet id such as subnet-07675blablablabla etc

golharam commented 2 years ago

The subnets we have are of the patten 'subnet-07ablaba'....8 digits.

markjschreiber commented 2 years ago

My strong suspicion is that this might be a bug coming from CDK. In our AGC patch release (1.5.1) we did update to a more recent version of CDK. Possibly that has resolved it, but if not I would recommend filing an issue with the AWS CDK project https://github.com/aws/aws-cdk

bmorrissirromb commented 9 months ago

https://github.com/aws/aws-cdk/issues/7131#issuecomment-1179396752

This seems to be the CDK behavior that is causing the problem. The NetworkLoadBalancedFargateService pulls ALL available subnets, which is a problem if you have multiple subnets in an AZ.

golharam commented 9 months ago

We do have multiple subnets, two of which we want to use for this. Not the rest.

bmorrissirromb commented 9 months ago

My current workaround is (bleh) manually removing the unwanted subnets from the CFT and redeploying manually.

golharam commented 9 months ago

The the subnets from what template? The VPC and subnets are created by our enterprise it group.

It's been at least a year since I tried using this toolkit but iirc, I specified the subnets I want to use in the config file I use with this toolkit.

bmorrissirromb commented 9 months ago

From the CFT generated by agc context deploy. I download it from the CloudFormation console, update it, and then create another stack with the updated CFT.