Open golharam opened 2 years ago
@adamnovak can you take a look at this. Might be an issue with how the infrastructure for Toil is being deployed?
@golharam are you able to deploy different engine types using the same account? For example are you able to deploy one of the contexts from https://github.com/aws/amazon-genomics-cli/tree/main/examples/gatk-best-practices-project-miniwdl
. If you are then we know this issue is limited to Toil rather than being a general issue.
Here's what AGC builds for Cromwell as the engine (which presumably works with multiple --subnets
?):
And here's what it builds for Toil as the engine:
With Cromwell, AGC creates an AWS Lambda serverless function to translate WES to what Cromwell can understand, and the ApiProxy
is pointed at the Lambda, which has been told the load balance'rs internal hostname and can make calls against it (this.engine.loadBalancer.loadBalancerDnsName
).
With Toil, the service behind the load balancer already speaks WES, so we just point the ApiProxy
at the load balancer.
The ApiProxy
supports being pointed at either a Lambda or directly at a load balancer, so this should work.
It looks like behind the scenes, it uses a LambdaIntegration from an AWS library when pointed at a Lambda, but makes both a VpcLink
and an HttpIntegration
when pointing directly at a load balancer.
Unfortunately, I don't have enough experience with either the CDK libraries or AWS's VPC connection, load balancer, and API proxy concepts for all this to suggest to me a reason why this would not work. Since the load balancer is created as part of the engine construction, which I think is really the same between Toil and Cromwell engines except for the containers, maybe the differences in front of the load blancer are red herrings? It's not clear to me whether attaching different things in front of the load balancer can, in the CDK, change how the load balancer itself is to be constructed, and result in it trying to attach to multiple subnets when it would not otherwise be.
I think the real problem here might be that --subnets subnet-abc,subnet-def
is asking for something impossible, or at least unsupported. Maybe you can only use multiple subnets for AGC if the subnets are all in different availability zones, and it turns out that in this case they happen to be in the same availability zone?
@markjschreiber Is AGC meant to allow you to use multiple subnets in the same availability zone?
The load balancers for Toil and Cromwell are both made by calling renderServiceWithTaskDefinition
, passing along the subnets and the engine container. And the resulting engine
is what provides the load balancer. So if there's a problem with how that is built, it is going to be in or behind that function.
It looks like you added the subnets feature in https://github.com/aws/amazon-genomics-cli/pull/436 @markjschreiber. So you might be better able to figure this out than me.
There is really no value in using more than one subnet from the same AZ as it will not give any resilience in the case of an AZ failure and it won't give AWS Batch access to an increased pool of EC2s instances to choose from if the subnets are in the same AZ. @golharam , does the deployment succeed if you provide subnet-ids from unique AZs?
@golharam are you able to deploy different engine types using the same account? For example are you able to deploy one of the contexts from
https://github.com/aws/amazon-genomics-cli/tree/main/examples/gatk-best-practices-project-miniwdl
. If you are then we know this issue is limited to Toil rather than being a general issue.
I have not tried the other engine types. The cloudformation error I'm getting is:
A load balancer cannot be attached to multiple subnets in the same Availability Zone (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: InvalidConfigurationRequest; Request ID: 7c1d1c22-197d-4c3c-b386-333f71b596e1; Proxy: null)
So I don't think its specific to Toil, rather its the AWS resource (AWS::ElasticLoadBalancingV2::LoadBalancer) creation. The subnets I'm specifying are in us-east-1d and us-east-1e which (to my understanding) are two different availability zones. These are the same subnets I'm using for ElasticBeanStalk for deployed apps that are also load balanced.
I can still try another engine type, if you think its worth trying.
There is really no value in using more than one subnet from the same AZ as it will not give any resilience in the case of an AZ failure and it won't give AWS Batch access to an increased pool of EC2s instances to choose from if the subnets are in the same AZ. @golharam , does the deployment succeed if you provide subnet-ids from unique AZs?
Due to the way our AWS networks are set up, I only have access to us-east-1d and us-east-1e. I tried to deploy to a single subnet (that exists in either us-east-1d or us-east-1e), but got an error (I can't recall exactly) that I needed two subnets for the load balancer.
us-east-1d and us-east-1e are indeed different availability zones. It's confusing that the ELB is not seeing this unless it is somehow trying to use the same subnet twice. Could you double check that these two subnet IDs do indeed point to different AZs?
I know for certain that you can deploy a miniwdl engine to specific subnets in different AZs, if you can try that this would give us more diagnostic information.
The subnets do in fact point to different AZs. I'm using those same subnets in other AWS applications that use Elastic LoadBalancer.
I tried deploying the wdl context, but that fails for the same reason:
A load balancer cannot be attached to multiple subnets in the same Availability Zone (Service: AmazonElasticLoadBalancing; Status Code: 400; Error Code: InvalidConfigurationRequest; Request ID: ...; Proxy: null)
So, basically, I'm stuck and can't deploy AGC at all...
Can you check if the subnet names listed in Systems Manager -> Parameter Store -> /agc/_common/InfraSubnets
are the values you believe they should be and that all of these values are in a unique AZ? Can you also check that the value stored for the /agc/_common/NumInfraSubnets
parameter matches the number of items in /agc/_common/InfraSubnets
?
If they are not then they will need re-setting via agc account deactivate
.
If they are then this might indicate a bug with AWS CDK which is used by AGC to generate the cloudformation templates that deploy the infrastructure.
They are both correct. Happy to share details privately, if that will help.
subnet 1 - us-east-1d subnet 2 - us-east-1e
/agc/_common/NumInfraSubnets == 2
Just to confirm that the listed subnets conform to the pattern of a subnet id such as subnet-07675blablablabla
etc
The subnets we have are of the patten 'subnet-07ablaba'....8 digits.
My strong suspicion is that this might be a bug coming from CDK. In our AGC patch release (1.5.1) we did update to a more recent version of CDK. Possibly that has resolved it, but if not I would recommend filing an issue with the AWS CDK project https://github.com/aws/aws-cdk
https://github.com/aws/aws-cdk/issues/7131#issuecomment-1179396752
This seems to be the CDK behavior that is causing the problem. The NetworkLoadBalancedFargateService pulls ALL available subnets, which is a problem if you have multiple subnets in an AZ.
We do have multiple subnets, two of which we want to use for this. Not the rest.
My current workaround is (bleh) manually removing the unwanted subnets from the CFT and redeploying manually.
The the subnets from what template? The VPC and subnets are created by our enterprise it group.
It's been at least a year since I tried using this toolkit but iirc, I specified the subnets I want to use in the config file I use with this toolkit.
From the CFT generated by agc context deploy
. I download it from the CloudFormation console, update it, and then create another stack with the updated CFT.
Describe the Bug
Following the instructions to get started, when I run
it fails with the error in CloudFormation:
Steps to Reproduce
where subnet-abc is in us-east-1d, and subnet-def is in us-east-1e
Relevant Logs
Expected Behavior
Expected behavior is that stack would have gotten created without error.
Actual Behavior
Stacks fails to get created.
Additional Context
Operating System: AGC Version: 1.5.0 Was AGC setup with a custom bucket: no Was AGC setup with a custom VPC: yes