cortexlabs / cortex

Production infrastructure for machine learning at scale
https://cortexlabs.com/
Apache License 2.0
8.02k stars 606 forks source link

Can we specify both private and public subnets for cortex deployment? #2090

Closed wilson-zimp closed 3 years ago

wilson-zimp commented 3 years ago

Hello,

I have deployed Cortex into an existing VPC in AWS. The worker nodes are deployed into private subnet but I want AWS load balancers to be in a public subnet and accessible only via whitelisted IP CIDRs. We follow NIST architecture for our vpc's.

How can I specify both public and private subnets and also ensure worker nodes are in private subnets while Load Balancers are in public subnets?

Below is the error I am seeing when I specify subnet_visibility: private and I have provided only private subnets in the subnets array

subnets:

  • availability_zone: us-west-2a subnet_id: subnet-xxxx
    • availability_zone: us-west-2b subnet_id: subnet-yyyy

Error syncing load balancer: failed to ensure load balancer: could not find any suitable subnets for creating the ELB

cluster.yaml file settings are below

subnet_visibility: private api_load_balancer_scheme: internet-facing api_load_balancer_cidr_white_list: [99.99.99.99/32] operator_load_balancer_scheme: internet-facing operator_load_balancer_cidr_white_list: [99.99.99.99/32] subnets:

  • availability_zone: us-west-2a subnet_id: subnet-xxxx
  • availability_zone: us-west-2b subnet_id: subnet-yyyy

Thanks!

deliahu commented 3 years ago

@wilson-zimp Yes, I believe it should be possible in theory, although is not currently supported by Cortex (for some reason I recall thinking that this was not possible when we originally implemented the feature, so either I was mistaken or eksctl has since enabled it.

That said, it is certainly possible (and common practice) to do this when allowing Cortex to create the VPC; setting subnet_visibility: private in your cluster configuration will achieve your desired configuration (where the nodes will be in private subnets with no external IPs, and the load balancer will be in the public subnets). Is there a reason that you cannot create the VPC during cluster creation time? Generally we recommend this approach (in combination with VPC Peering when necessary, although in your case, since the load balancer will be public, VPC Peering is not necessary).

wilson-zimp commented 3 years ago

@deliahu - we already have a VPC with NAT gateway's, networking and security groups defined as we want. Hence we wanted to use existing VPC.

deliahu commented 3 years ago

@wilson-zimp Does the VPC that you deploy Cortex into have other applications running in it? Or do you create an empty VPC with the configuration you want, and then run only Cortex in that VPC?

Also, do we (or if not, could we) expose a way to configure the VPC that Cortex creates to meet your needs? What specifically do you configure in your VPC that means that you prefer not to (or cannot) allow Cortex to create its own VPC? The reason I ask is that we are not sure about whether we should support deploying into existing VPCs in the long run; it will make it harder to add additional features on our roadmap (such as more automated cluster upgrades) if we can't assume a "blank slate" VPC that is owned by Cortex. Also, deploying into an existing VPC requires a fair amount of manual configuration (as I'm sure you know) that can be hard to get right and hard to debug. And since Cortex is running alongside other apps, unexpected and unpredictable conflicts could arise. Since environments would not be consistent across different users, this could be hard for us to debug and support.

Sorry for all of the questions, it's because this is an active topic that the team is discussing internally. I'd also be happy to jump on a call if you think that'd be easier (if so, feel free to email me at david@cortex.dev to find a time).

deliahu commented 3 years ago

@wilson-zimp I wanted to follow up on my previous comment; let me know if you have any thoughts. I just want to get a clearer picture of your requirements before we'd decide to expand our support for deploying into an existing VPC.

wilson-zimp commented 3 years ago

Hi @deliahu , sry for the delayed response. Our use case is, we have an existing application in eks cluster, to enhance the functionality of this application, we wanted to deploy a ML model which is served through Cortex. The model behind Cortex is internal only application, so it should be only reachable by the services which are already running the existing eks cluster.

So, we wanted to deploy cortex with in the existing cluster so that existing applications/services can reach the cortex without having the traffic to go out of current eks cluster. Also, we wanted to avoid setting up any extra infrastructure for cost optimization.

Please let me know if I was able to answer your questions, else I can jump on a call and explain in detail.

deliahu commented 3 years ago

Thanks for your response! Yes, I think it might be best to jump on a quick call, since there are a few more questions I'd like to ask about your setup (e.g. would it work to set up VPC peering to your Cortex cluster's VPC?). Feel free to email me at david@cortex.dev and we can find a time!

deliahu commented 3 years ago

I'll go ahead and mark this issue as resolved.

After discussing, the best approach is to allow Cortex to create the VPC, and to configure the API load balancer to be public (api_load_balancer_scheme: internet-facing) and the instance subnets to be private (subnet_visibility: private). If necessary in the future, an IP whitelist can be used to restrict access to the API load balancer (api_load_balancer_cidr_white_list), or the API load balancer can be made private (api_load_balancer_scheme: internal) and VPC peering can be used to connect to the cluster's VPC. Here is our VPC Peering guide for AWS, and it may be possible to configure VPC peering across cloud providers.