concourse / prod

bosh/terraform config for our deployments
3 stars 5 forks source link

Remove NATing from BOSH networks #35

Open cirocosta opened 4 years ago

cirocosta commented 4 years ago

Hey,

We've been recently receiving complaints that resources like docker-image and registry-image have been failing with "429 Too Many Requests".

While we did introduce retries at the resource-type level for registry-image, (see https://github.com/concourse/registry-image-resource/pull/69) those using docker-image (or trying to reach dockerhub directly) would still suffer from the limit being place on our IP.

My hypothesis is that by removing the NAT machine that we have in the bosh network (which ends up making every request from any of the 40+ machines we have going out from that single IP), we can then get rid of the problems we're currently facing w/ regards to limits on the number of requests (aside from reducing one hop and a single point of failure).

Last week, I naively tried just removing the routes that we have set at the network level

https://github.com/concourse/prod/blob/92cf1772c3e15ff48543caa4c82c9e602d12016a/iaas/bosh.tf#L135-L153

but that didn't really work as expected as the machines that we create in the bosh network do not assign ephemeral external IPs:

"The instance must have an external IP address. An external IP can be assigned to an instance when it is created or after it has been created."

(from https://cloud.google.com/vpc/docs/vpc#internet_access_reqs)

https://github.com/concourse/prod/blob/92cf1772c3e15ff48543caa4c82c9e602d12016a/bosh/cloud_config.yml#L29-L36

Given that we're on GCP, we can overcome that by using the ephemeral_external_ip property - see https://bosh.io/docs/google-cpi/#networks.

Should we do that? I think so - if we don't have the requirement of having those machines completely unreachable at all (not really true in our case), I think we should just drop it.

Thanks!

xtreme-sameer-vohra commented 4 years ago

We could use firewall rules & tags to ensure only outbound requests are allowed from the workers.

However, we don't have anyway of enforcing that those remain in place. For example, someone would be able to remove those rules or inadvertently change the tags/network name etc and we wouldn't know about it.

cirocosta commented 4 years ago

However, we don't have anyway of enforcing that those remain in place

yeah, while I do agree that that's indeed true and easy to misconfigure, I think it's just inevitable that our move to "protect the endpoints as if you were already compromised", and this can be a motivator to getting better at this (w/ e.g. issues like https://github.com/concourse/concourse/issues/2415 and not exposing endpoints w/out auth in general) 🤔

(my point being that by forcing ourselves to rely less on a "perimeter of protection", we can be even more motivated to get our infra better protected to any scenario)