awslabs / benchmark-ai

Anubis (formerly known as Benchmark AI), measures the goodness of machine learning workloads
Apache License 2.0
16 stars 6 forks source link

Make anubis endpoint reachable from Native AWS #1013

Closed surajkota closed 3 years ago

surajkota commented 4 years ago

Currently the endpoint is only accessible to those inside Amazon firewall. Customers/teams are building their infrastructure using Native AWS and would like to submit jobs from native AWS services.

First step - Find a way to make endpoint reachable from EC2

surajkota commented 4 years ago

Here is what we have tried already -

Created an EC2 instance in the same vpc as EKS cluster in subnet named "test-vpc-public-*" and added anubis_bff_external_access and k8s-elb-a7cc399d8c48b11e9953f1286abd5a26 security groups. We got a public ip for the instance so we were able to login by adding ssh access to it but the endpoint was not reachable

We also tried adding the CIDR block of the subnet to the inbound rules of both the security groups (according to our discussion AWS is taks the most open of all the secutiry group rules) but the endpoint was still not reachable

gavinmbell commented 4 years ago

Did you add the CIDR block to the anubis-setup call... I believe it is EKS that is where the CIDR information is used to have EKS apply it to the bff "service" so that it is accessible. If you did it outside of anubis-setup you may not get the desired effect. Try adding it to the anubis-setup command. We can try to dig up an example command line invocation to use.

perdasilva commented 4 years ago

Hey guys,

Sorry about the delay. So you can understand how the security group ip range restriction is set and configured, here's the PR that fixed the security issues #989

It updates the baidriver to query the CIDR blocks for the supplied prefix list id and injects them into the infrastructure config map => This means, nothing changed on the anubis-setup side, i.e. it sill takes a prefix list id as a parameter.

The values are then applied to the bff and prometheus service resources as service.beta.kubernetes.io/load-balancer-source-ranges annotations.

Under the hood, Kubernetes spawns the AWS resources to make the service externally available (ELB, etc.). It creates a security group with the service ports and restricts inbound traffic to the ip range set in the service.beta.kubernetes.io/load-balancer-source-ranges annotation and attaches it to the ELB.

Maybe it might be worthwhile to investigate the service.beta.kubernetes.io/aws-load-balancer-internal annotation. Maybe you could get away with not setting any source ranges then - and you would need to use the internal DNS name.

@Chancebair they want to make the bff internally available (reachable by ec2 instances in their account) - what would be the standard practice here? Create a new VPC allow inter VPC communication somehow?

gavinmbell commented 4 years ago

One thing that could be done with regards to outside traffic connecting to the BFF would be to put the BFF behind an API gateway.

perdasilva commented 4 years ago

I still feel like we need to understand the use case a bit better.

Will the customers/teams have their own deployment of Anubis? Or will you be managing the deployment for them?

In this case, I think the only safe/sensible thing to do is to deploy the bff behind API Gateway - and manage the security at that point.

If the Anubis will be deployed in the same account as the callers, we could think about internal ELBs and vpc to vpc or subnet to subnet communication...i.e. keep the networking internal to the account and avoid having the bff serving calls from the wild

YangFei1990 commented 4 years ago

Just for an example use case: I have code in AWS CodeCommit and I want to use AWS CodeBuild to build and test my code. In such case is it possible for AWS CodeBuild to connect to the Anubis endpoint? There might be 2 blocks:

  1. We need to set up the Anubis env so that Anubis API is runnable
  2. The Anubis endpoint can only be reached inside the Corp network
arjkesh commented 4 years ago

Hey folks, wanted to re-open this discussion. We are trying to submit TOML files from an EC2 instance or CodeBuild job. Is there a workaround where we can whitelist some CIDR block from a VPC we create, and then launch instances from that VPC to run Anubis jobs? Other workarounds ideas are welcome

ryansteakley commented 3 years ago

PR: https://github.com/awslabs/benchmark-ai/pull/1031
Anubis reachable from NAWS