Netflix / metaflow-tools

:rocket: Deployment tools/scripts for Metaflow!
http://www.metaflow.org
Apache License 2.0
52 stars 47 forks source link

Params for existing vpc #21

Closed emocibob closed 3 years ago

emocibob commented 3 years ago

Add optional CF parameters which allow reusing an existing VPC instead of creating a new one.

AWS resources will be put it an existing VPC if all parameters named "Existing*" are given.

Use cases:

queueburt commented 3 years ago

Everything looks solid at a glance. I'm planning to test this with a couple of scenarios tomorrow, will update with more info then.

queueburt commented 3 years ago

I'm fine with the technical functionality here for its core use-case. I've tested it and it seems to behave as expected. I'm just a little questionable on the user-experience. I made some comments on specific code lines, but the short version is that the conditionals are locked down to very specific specific VPC configurations (2 public subnets and at least 1 private subnet, as far as I can tell), and the empty fields seem to mess with Service Catalog.

I realize that CFN isn't the cleanest language to write conditionals in, but is there a tidier way to do this? Perhaps a boolean for "Create a new subnet", and then reference comma-separated lists for both public and private subnets in whatever resources might need them?

emocibob commented 3 years ago

Here's a little more info. First, our main motivation for the PR.

Instead of creating a new VPC, add an option to reuse an existing VPC. Reusing a VPC has multiple advantages:

In our specific case, using our own VPC lets us use a static IP in AWS Batch jobs. We can whitelist the static IP and access protected resources.

About the implementation, referencing resources outside a template can be done in at least two ways:

a) cross-stack references (depends on an existing stack which exports some values) b) provide resources via parameters

Even though b requires more code, it's more flexible since there are no explicit dependencies.

the conditionals are locked down to very specific specific VPC configurations (2 public subnets and at least 1 private subnet, as far as I can tell), and the empty fields seem to mess with Service Catalog.

We can add a "Create a new subnet" boolean parameter, as you suggested, and reference comma-separated lists of subnets (e.g. a list of public and a list of private subnets). But then we again need to "hardcode" if a resource will be put into public or into private subnets. Another solution would be to add an optional comma-separated list of subnets for each resource that's put into a subnet (i.e. a subnets parameter for ComputeEnvironment, a subnets parameter for NLB, etc.). This approach requires more code but it's much more flexible.

Regarding the empty fields, we can use some non-empty placeholder like "NONE" or "false".

queueburt commented 3 years ago

I'd ask @savingoyal to weigh in on how he envisions the user-experience for this to feel, but I don't necessarily have a strong opinion on the "how" of it, as long as whatever gets rolled into the PR supports at least a handful of the most common use-cases likely to be encountered by people reusing VPC's for Metaflow. That said, I'm well aware of the complexity that comes from trying to predict even a fraction of the possible VPC configurations and account for them, and I certainly don't want to ask anyone to reinvent the wheel.

To simplify my ask a bit, I'm really just looking to see if we can 1/ make private subnets optional and not required in the conditional, and (less importantly) 2/ make the public subnet parameters a list similar to private. To be clear, I'm not asking for public and private subnets to be the same list, I'm asking about the benefit of them both being individual lists as opposed to separate parameters for public subnets and a list for private. My reasons for the asks are two-fold:

1/ Please feel free to correct me if I'm wrong on this, but unless I'm missing something after testing and reading the conditionals, users trying to attach new Metaflow infrastructure to existing VPC's created by current versions of this template would be unable to do so as long as private subnets are a requirement. With the intent here being the option to reuse existing VPC's, I'd prefer not to introduce functionality into the template that's incompatible with VPC's created using the template. This feels relatively low-touch, I only see this parameter referenced in the Batch Compute environment and nowhere else.

2/ People using their own VPC's are likely to have more than (2) public subnets. We chose to keep this template relatively low-complexity and only allocated the resources necessary to provide basic resiliency. In practicality, we'd expect to see people using larger self-managed VPC's using higher numbers of matching public/private subnets, and it'd be great to accommodate those use-cases. This is certainly a bigger code change, though, so I'd understand if that's deemed to be a task for a separate PR.

As far as actioning these things, I'm open to whatever makes the most sense. I'm working on a PR for some new functionality, I'm happy to take a stab at incorporating one or both of these myself, but my SLA will be at least a week. Alternatively, if the consensus is to bring this PR in as-is and make adjustments incrementally (if necessary), I'm okay with that too, or anything in between.

About the implementation, referencing resources outside a template can be done in at least two ways:

a) cross-stack references (depends on an existing stack which exports some values) b) provide resources via parameters

Regarding this point, as I said, I don't have a super strong opinion, but it is worth noting that we've avoided cross-stack referencing to-date simply because it either 1/ increases the complexity to the end-user or 2/ increases the number of places we'd need to maintain templates in order to abstract the complexity. If we can simplify the ask and keep it in the same template as parameters, that'd make me the happiest.

emocibob commented 3 years ago

@queueburt Here's some more info.

users trying to attach new Metaflow infrastructure to existing VPC's created by current versions of this template would be unable to do so as long as private subnets are a requirement.

Correct, see below for a different approach.

1/ make private subnets optional and not required in the conditional 2/ make the public subnet parameters a list similar to private we'd expect to see people using larger self-managed VPC's using higher numbers of matching public/private subnets, and it'd be great to accommodate those use-cases

Instead of the approach in this PR, how about adding a parameter (comma-delimited list of subnets) for each resource that requires subnets? In other words, the user can enter a list of subnets (it doesn't matter if they're private or public) for any resource that has a subnet property. Each resource would have its own parameter. I count six such resources (NLB, Fargate, Lambda, RDS DB, SageMaker instance, Batch compute environment). A new VPC is created if the user doesn't provide one or more comma-delimited list of subnets.

Six new parameters might not be the prettiest solution, but it provides an opportunity to the user to create a new VPC, reuse an existing VPC or combine the two. If that sounds good, I can push a commit to this PR with the new approach.

Here's a sample for the approach I outlined above:

Parameters:
  ComputeEnvironmentSubnets:
    Type: CommaDelimitedList
    Default: 'NONE'
    Description: 'Optional, one or more subnets in which to put AWS::Batch::ComputeEnvironment'
  # ...
Conditions:
  CreateOwnVpc: !Or
    - !Equals [ !Ref ComputeEnvironmentSubnets, 'NONE' ]
    - !Equals [ !Ref NLBSubnets, 'NONE' ]
    # ... more parameters
Resources:
  ComputeEnvironment:
    Type: AWS::Batch::ComputeEnvironment
    Properties:
      ComputeResources:
        Subnets: !If
          - !Equals [ !Ref ComputeEnvironmentSubnets, 'NONE' ]  # Check if a list of subnets for the compute environment was provided
          - [ !Ref Subnet1, !Ref Subnet2 ]  # Same subnets used in the template in the master branch
          - !Ref ComputeEnvironmentSubnets  # List of subnets given by user
queueburt commented 3 years ago

I like the flexibility of that proposal. I agree that the addition of an extra parameter per resource isn't the prettiest, but it feels like an efficient way to ensure the template is as flexible as possible within the confines of Cloudformation. Assuming it's cool with Savin and team, I'm happy.

savingoyal commented 3 years ago

@emocibob The current philosophy with the cloud formation template is to provide data scientists a simple way to get started with Metaflow, while fully recognizing that this template might not cover many advanced deployment scenarios. We have so far purposefully kept the parameters limited so as to make it easy for folks who might not be well-versed with AWS-foo to have a straightforward onboarding experience as well as ensure that our testing burden going forward is kept low (the lower the number of toggles, the less we have to test every time we make an update to the template). I wonder if there is a way to maintain two templates and/or maybe add some documentation in admin-docs.metaflow.org around what changes users can go about making to the primary template to re-use their existing VPCs. Let me know your thoughts.

emocibob commented 3 years ago

Hi @savingoyal

The goal of keeping the template relatively simple makes sense. My proposed changes would certainly add to the complexity with new parameters and conditionals. New optional params would also increase the number of input combinations to test.

I wonder if there is a way to maintain two templates

From a maintenance perspective, an additional template would mean extra testing and (manual?) syncing between templates. Tobias Grosse-Puppendahl and I went over the possible solutions and agreed that we'll fork the repo and make our changes there.

maybe add some documentation in admin-docs.metaflow.org around what changes users can go about making to the primary template to re-use their existing VPCs

Once we have a relatively stable and flexible template, we can contribute to the docs with our insights.

I suggest you reject my PR. Thanks for the feedback along the way :)

savingoyal commented 3 years ago

That will be great! Let us know when you have something to share for the documentation. I will archive this PR meanwhile with the note that this is a functional PR for anyone who wants to reuse existing VPCs for configuring their Metaflow deployment. Thank you for your efforts!