cloudfoundry / cf-deployment

The canonical open source deployment manifest for Cloud Foundry
Apache License 2.0
295 stars 305 forks source link

TLS for everything #906

Open jmprice opened 3 years ago

jmprice commented 3 years ago

What is this issue about?

There has been a lot of excellent progress in securing all CF traffic with TLS and as far as I can tell there are only a few things that are still unencrypted.
Is there a timeline or any plans for these last few things?

1) routing-api - still using both TLS and non-TLS in the cf-deployment. The http endpoint is what is registered in the router. Is there a reason for still enabling both? 2) metrics-discovery-registrar-windows - not using nats-tls hostname, falling back to 4222. We have pull request in for this one already (https://github.com/cloudfoundry/metrics-discovery-release/pull/6) 3) route_registrar - not using nats-tls 4) gorouter - not using nats-tls

What version of cf-deployment are you using?

[cf-deployment v13.19.0]

Tag your pair, your PM, and/or team!

@amhuber

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/174846740

The labels on this github issue will be updated when the story is started.

46bit commented 3 years ago

As well as the things mentioned above, my team believes that Silk (apps.internal) VXLAN traffic between cells is unencrypted.

heycait commented 3 years ago

Hi all, metric egress team here. We own the metrics-discover-release. The issue with windows was unintentional on our part. We are aware of it and will prioritize it soon see https://github.com/cloudfoundry/metrics-discovery-release/pull/6. It should make it into CF-D fairly quickly after we cut a release.

davewalter commented 3 years ago

I reached out to the Networking folks in Slack for assistance with the connections related to routing and nats.

mcwumbly commented 3 years ago

@davewalter thanks for the ping. Per this discussion on cf-dev, I suggest we keep this issue open as the "canonical home" for tracking this issue going forward.

That said, I suggest we scope this issue does to TLS for all platform components can exclude this bit:

my team believes that Silk (apps.internal) VXLAN traffic between cells is unencrypted.

while it is true that the platform does not encrypt the VXLAN traffic between cells today, my take is that this is a slightly different concern, and those who are deploying apps to the platform can encrypt the traffic between their apps. I don't dispute it'd be nice if this were a built-in feature of the platform, but I do think this is a reasonable place to divide this issue.

Perhaps we can rename this one and @46bit can open another one for that feature?

For the rest of the items above, can they all technically be addressed by changes to cf-deployment? Or will they need changes in any of the respective BOSH releases, like https://github.com/cloudfoundry/routing-release? (I can probably answer some of that myself, but haven't yet taken the time to dig in).

One other note: there's this open issue on nats-release right now, which I believe may in part be caused by the fact that we do have TLS turned on between the Diego Cell route-emitters and NATS: https://github.com/cloudfoundry/nats-release/issues/25

So, we might discover that flipping all these things on has some side effects that are only seen with large deployments, and which do not rear their heads in CI.

amhuber commented 3 years ago

For the different items reported originally:

1) The routing-api can be configured to TLS only but cf-deployment is currently configuring it to listen on both HTTP and HTTPS endpoints (https://github.com/cloudfoundry/cf-deployment/blob/master/cf-deployment.yml#L1039). We haven't been able to find anything that still connects directly to the HTTP endpoint, and having both endpoints enabled appears to force the routing-api to register it's HTTP interface with gorouter vs. the HTTPS one.

2) We've already submitted a pull request for this as listed.

3) The route_registrar already supports TLS connections to nats but it's not enabled in cf-deployment (https://github.com/cloudfoundry/routing-release/blob/develop/jobs/route_registrar/templates/registrar_settings.json.erb#L106).

4) I haven't found code in gorouter to support TLS connections to nats so I think this one actually requires some code changes to fix.

For what it's worth, regarding your mention of the nats issue, I think having nats currently split between TLS and non-TLS makes issues like that more likely. Right now cf-deployment is using 2 different nats daemons per VM, replicating all messages between the TLS and non-TLS endpoints, with clients split between both. It would be far more stable (and less resource intensive) to complete the move to nats-tls only and get rid of the duplicate daemons listening on the non-TLS enpoints and the requisite replication between them.

ameowlia commented 3 years ago

Hi all,

I am an engineer on the CF for VMs networking team.

The work for "tls everywhere" is currently not a priority for our team since there is a workaround (ipsec). However, we would be happy to support it if any community members wanted to PR it in.

To the best of my knowledge, the following paths related to routing, cf-networking, and silk are not encrypted:

I don't know what work there is (if any) for other components. Additionally, the following routes to components are not encrypted by default:

There maybe more unencrypted routes to components. I just checked on a pretty minimal test env. But you can check by looking at the routes on the router and seeing which are tls: false.

amhuber commented 3 years ago

Setting aside the technical details for a moment, I had hoped at least there was consensus that it's not OK to have unencrypted sensitive traffic on any network. IPSec is a workaround, but it's painful and expensive because we end up double encrypting most traffic just to encrypt the few stragglers that aren't using TLS yet. This was previously communicated to the community as a high priority from a security perspective now so I'm hoping we can get back to a place where teams are working on closing the few remaining gaps that exist. It seems odd for so many people to have put effort into moving huge parts of the platform to mTLS just to leave the whole platform vulnerable to simple attacks. For example, right now anyone with network access to a CF foundation can trivially own nats as the password is in clear text on the network and then cause whatever routing chaos they want.

jmprice commented 3 years ago

Additionally for those of us who are also running Windows cells as part of our foundations, IPsec encryption is not supported for Windows containers (https://github.com/Microsoft/hcsshim/issues/244) so we are stuck on Windows 2012R2 which is not only an aging OS but is no longer being actively maintained or supported as part of CF. At some point, and I fear sooner than later, we are going to reach an impasse where we must upgrade to Win2019 and can no longer use IPsec.

voelzmo commented 3 years ago

Regarding the topic of using IPSec as a workaround, I'd like to point out that this might be true for products and commercial distributions, but I'm not aware of an open-source solution solving this issue. So from a vendor perspective, it might be true that there is a workaround available, if we're talking about open-source, this does not seem to be the case.

@ameowlia: can you help us understand how we can get this work prioritized in the open source team? It might be helpful to get an understanding what the conflicting priorities are and when it would be possible to prioritize getting TLS everywhere in.

ameowlia commented 3 years ago

Hi all,

I hear your pain.

@jmprice, regarding windows, I believe the only part that was not encrypted is route emitter to nats. As of diego-release v2.41.0, that is now encrypted. Additionally, I know they worked on adding a sidecar to preform TLS from gorouter to the app container (though this might only be available on newer version of windows). Thus, if ipsec is used everywhere for linux, then the entire foundation should be encrypted. Please correct me if I am wrong here, I am not very involved in the windows world.

@voelzmo, my mistake, you are correct. I forgot that SAP decided to sunset OS ipsec.

I am tagging some people who make these platform wide decisions on the vmware side: @dieucao, @zrob, @emalm, @dsabeti, @mkocher

✨ of course we always welcome PRs from the community

ameowlia commented 3 years ago

Hi @plowin,

Per your question here, https://github.com/cloudfoundry/routing-release/issues/185#issuecomment-709776891. This issues contains more information and is the best place to ask questions.

voelzmo commented 3 years ago

Hi there, it's been a while. I'm still looking to understand this

can you help us understand how we can get this work prioritized in the open source team? It might be helpful to get an understanding what the conflicting priorities are and when it would be possible to prioritize getting TLS everywhere in.

especially now that we've come to a common understanding that the initial assumption that there exists a workaround doesn't hold true.

I appreciate the "PRs welcome" message, however, given that we don't have people with context on the codebase for the involved projects, this will most likely remain a theoretical option, sorry.

voelzmo commented 3 years ago

Just another ping after 3 more weeks have passed. Can we talk about this, either in this ticket or in a direct meeting?

ameowlia commented 3 years ago

Hi @voelzmo ,

I appreciate your persistence 😅

The people you really want to reach out to around this is is: @dsabeti & @dsboulder.

@dsabeti and @dsboulder, I think doing this work would be a huge win for security and ease-of-use (getting off of ipsec). For the non-TLS connections that we know about I suspect it will take a pair a month of work to complete (as long as they aren't distracted by other things that come up).

amhuber commented 3 years ago

Just as an FYI, I tried setting routing_api.enabled_api_endpoints to just "mtls" in a test environment and the routing-api did start up and was no longer listening on the HTTP port (3000) as expected, but the deployment was not functional. The DNS healthcheck is configured to listen on the HTTP port so it fails (https://github.com/cloudfoundry/routing-release/blob/develop/jobs/routing-api/templates/dns_health_check.erb#L3) and the routing-api still registered the api.system_domain/routing route using port 3000 with TLS disabled.

Is there any plan to resolve these issues so the platform is fully encrypted on the wire by default?

ameowlia commented 3 years ago

Hi @amhuber , that sounds like a bug with existing functionality. Can you write up a github issue for routing-release for this issue?

voelzmo commented 3 years ago

Hi @jenspinney, @dsboulder, @dsabeti,

This issue is about half a year old now. Several members of the community have shown interest in this getting fixed, some have invested a fair amount of work into digging in the details, trying out existing things (and reporting back where they didn't work) and even providing PRs for enabling TLS in some places.

We haven't seen any communication on

Do we have a common understanding that this is an important, valueable, and necessary thing to do? Can you help me understand where we have a different perspective on this issue? I'm happy spending some time talking about this in a meeting, if you prefer this – but we should still keep this issue updated in order to be transparent for everyone in the community.

PS: Kudos to @ameowlia for being the person visibly invested in this, thanks for staying engaged and dedicating some of your time for this!

ameowlia commented 3 years ago

Hi @voelzmo,

Thank you for your persistence. It's my feeling that prioritizing big things like this has been put on hold until the new CFF governance stuff has been worked out, which hopefully will wrap up in the next few months.

When the CFF changes happen I plan to be involved in the networking technical group (which I assume will include routing as well). Hopefully you will join me 😄 ? Then we can work to prioritize and make the changes needed for this.

I would still love for @jenspinney, @dsabeti ,@dsboulder to comment with their perspectives.

46bit commented 3 years ago

NATS--Gorouter is now encrypted as of https://github.com/cloudfoundry/cf-deployment/pull/925.

Routing API can be switched to HTTPS if someone interested carried on my commits on https://github.com/cloudfoundry/routing-release/issues/193.

46bit commented 2 years ago

I've unassigned myself as my work on this is now complete. Sadly I'm only paid to address items affecting my team. 😢 Silk traffic and one routing-api endpoint are still plaintext.

amhuber commented 1 year ago

The gorouter -> routing-api traffic is now using mTLS after https://github.com/cloudfoundry/cf-deployment/pull/1014 and https://github.com/cloudfoundry/routing-release/pull/300.