Netflix / metaflow-tools

:rocket: Deployment tools/scripts for Metaflow!
http://www.metaflow.org
Apache License 2.0
52 stars 47 forks source link

AWS Batch Fargate integration #23

Closed queueburt closed 3 years ago

queueburt commented 3 years ago

User-facing changes: A new "BatchType" parameter that's set to 'ec2' by default. Users can optionally set this parameter to 'fargate' which will alter the AWS Batch Compute Environment that's created to 'FARGATE'.

Considerations: Fundamentally, this just overlays a conditional for the "Type" parameter of the AWS Compute Environment's "ComputeResources" grouping to change it between EC2 and Fargate. Additionally, in order to enable Fargate, a similar mechanism needs to depopulate the values for MinvCpus, DesiredVCpus, InstanceTypes, and InstanceRole. All of those parameters will be ignored if 'fargate' is selected, even if set by the user. Lastly, in order to facilitate logging, a CreateLogStream and PutLogs permission was added to the ECS Task role. This wasn't required previously, as the instance itself logged on behalf of the container.

savingoyal commented 3 years ago

@queueburt It seems that we don't explicitly allow DescribeComputeEnvironment call for AWS Batch in this template (at least I couldn't find it) however Metaflow is able to issue the call and get a response against this PR as well as the tip of the master.

queueburt commented 3 years ago

@queueburt It seems that we don't explicitly allow DescribeComputeEnvironment call for AWS Batch in this template (at least I couldn't find it) however Metaflow is able to issue the call and get a response against this PR as well as the tip of the master.

This is odd. Let me dig in and see if I can figure out root cause. What's the context in which this call succeeds? Is it a local laptop user, or does it happen in the context of Batch and its assumed execution roles as well?

queueburt commented 3 years ago

@savingoyal After some investigation, the likely culprit is that the Describe call always happens client-side (correct me if I'm wrong, I only did a surface level dig through the code) prior to either submitting the jobs directly to Batch or generating them for Step Functions. Effectively, this would mean that as long as your local user has DescribeComputeEnvironment permissions, the call will pass. I tested this against the restricted user we create with this template and got the expected denial:

2021-02-16 12:06:36.870 [6/start/20 (pid 40503)] An error occurred (AccessDeniedException) when calling the DescribeComputeEnvironments operation: User: arn:aws:sts::xxxxxxxxxxxxx:assumed-role/SC-263972292686-pp-bmcfyqhhtlgtq-MetaflowUserRole-1OWG4UPWZ81EW/botocore-session-1613505996 is not authorized to perform: batch:DescribeComputeEnvironments on resource: *

What's our best fix for this? It seems safe to assume that given Batch's current requirements, DescribeComputeEnvironment should be enabled for our restricted user.