envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.94k stars 4.8k forks source link

when `network.rbac` filter is used in combination with `tcp_proxy`, connection to the upstream is established in any case, even if RBAC decision is `Deny` #9023

Open yskopets opened 4 years ago

yskopets commented 4 years ago

Title: when network.rbac filter is used in combination with tcp_proxy, connection to the upstream is established in any case, even if RBAC decision is Deny

Description:

Repro steps:

mattklein123 commented 4 years ago

The issue is that in the network RBAC filter the check is being done in onData() and it should be done in onNewConnection(). This is so broken that I can't imagine anyone is actually using this. At least I hope not. :scream:

yskopets commented 4 years ago

@mattklein123 I could take this issue

kyessenov commented 4 years ago

/cc @yangminzhu

rshriram commented 4 years ago

The check needs to be done in both. Not just on new connection, as we have codecs that can decode the payloads

PiotrSikora commented 4 years ago

@yangminzhu don't we need this fixed for Istio 1.5?

yangminzhu commented 4 years ago

The issue is that in the network RBAC filter the check is being done in onData() and it should be done in onNewConnection(). This is so broken that I can't imagine anyone is actually using this. At least I hope not. scream

A bit more background on this, we had a more detailed discussion in a private email right after this issue was filed. Note that the check is actually added to onData() intentionally because the x509 certificate is not available in onNewConnection(). (we followed the same pattern in other filter). In other words, we cannot fix just by moving the check to onNewConnection() because it will then always reject a connection if x509 is used in the API. The real fix is to change the behavior of the onNewConnection() and make its x509 certificate available in the callback, this requires some re-design of the Envoy core framework (not in the RBAC filter).

The impact of the issue is limited to server-first protocol which is probably not a very common use case here. I think we had some rough consensus that the severity is relative low and it's okay to fix this in public. @mattklein123 @htuch please let me know if you have other thoughts, thank you.

@yangminzhu don't we need this fixed for Istio 1.5?

@PiotrSikora I think it's unlikely to get this fixed for Istio 1.5 due to the complexity of the fix and the short time (about 3 weeks) of the 1.5 release. @yskopets please let me know if you are still working on this. I can take this if you're not going to work on this any soon. I will only have time to look at this after 1.5 release and will need some guidance from the Envoy maintainers because I'm not familiar with the framework related to the onNewConnection() in Envoy. (if someone with more experience/time could take this, please feel free to go ahead :) )

mattklein123 commented 4 years ago

The impact of the issue is limited to server-first protocol which is probably not a very common use case here. I think we had some rough consensus that the severity is relative low and it's okay to fix this in public. @mattklein123 @htuch please let me know if you have other thoughts, thank you.

Yes I think we should just fix this in OSS. I agree the fix is not simple and will take some time to get right. I'm happy to work with whoever is going to fix this to provide design help.

dilyevsky commented 3 years ago

@mattklein123 maybe naive question but can tcpProxy's upstream connection selection (initializeUpstreamConnection) be initiated lazily on first onData instead of onNewConnection? That way RBAC and other preceding filters can send StopIteration from onNewConnection and do required processed within onConnectionEvent or even onData.

mattklein123 commented 3 years ago

@mattklein123 maybe naive question but can tcpProxy's upstream connection selection (initializeUpstreamConnection) be initiated lazily on first onData instead of onNewConnection?

Unfortunately not, as this will break server send first protocols (yes they exist, e.g., MySQL). This is the way the code used to work way back in the day before we figured out the MySQL, etc. issue. The only possible fix here is to change how onNewConncetion() works to wait until after the TLS handshake completes.

kyessenov commented 1 year ago

This is also unfortunately affecting internal listener routing as well. Ideally, we want to derive some x509-based attributes before establishing an internal connection (e.g. compute the principal, prior to re-using a pooled HTTP internal connection).

kyessenov commented 1 year ago

@mattklein123 Sorry to resurrect this old thread, but I think there's a fairly simple fix: we can just wait to read data by the chain until Connected is raised. There's a visible side effect that whatever work was done in onNewConnection will be done later, with extra latency I suppose. Should I just make this fix and deal with possible downstream effects? Or should we just add another callback onConnected that happens before onData but after onNewConnection?

mattklein123 commented 1 year ago

@mattklein123 Sorry to resurrect this old thread, but I think there's a fairly simple fix: we can just wait to read data by the chain until Connected is raised. There's a visible side effect that whatever work was done in onNewConnection will be done later, with extra latency I suppose. Should I just make this fix and deal with possible downstream effects? Or should we just add another callback onConnected that happens before onData but after onNewConnection?

Sorry I've lost the thread. Can you summarize the issue again and how the fix works and doesn't break MySQL?

kyessenov commented 1 year ago

@mattklein123

The issue is that onNewConnection is called before the TLS handshake completes. What I'm proposing is that we wait for TLS handshake to complete instead of calling it immediately when the TLS socket is created. This can be done by waiting for the Connected connection event, which TLS socket correctly sends.

I think this should work fine for server-first protocols:

mattklein123 commented 1 year ago

At face value it sounds like that should work, though I'm not sure why we wouldn't have considered this before so maybe there is some edge case we are not thinking about? But we can definitely try it.

kyessenov commented 1 year ago

Ok, then I will open a PR after I fix the test breakages (looks like mocks need to be updated). I don't understand why this wouldn't work, so if anyone remembers any subtle issues, let me know :)

kyessenov commented 1 year ago

I found a pathological event sequence because tcp_proxy disables connection socket reads before establishing an upstream connection. That means the server SSL handshake does not complete (since it needs to read) before the upstream is selected, which requires onNewConnection. @mattklein123 Do you happen to remember why tcp_proxy does this?

I think what we can do is to enable reads before getting the server transport socket Connected event. So it would be a bit more complicated:

There's probably some issue I'm missing with the above, maybe we over-read the handshake so we'd need some extra read buffer...

mattklein123 commented 1 year ago

@mattklein123 Do you happen to remember why tcp_proxy does this?

Sorry I have no idea. :( The only thing I can quickly think of it has to do with flow control and not wanting to keep reading before the upstream is ready. Perhaps we have to check whether it's TLS so that the handshake processor will deal with bad data? And for raw TCP do what we do today? cc @alyssawilk @ggreenway

PiotrSikora commented 1 year ago

There is some additional discussion / edge case in https://github.com/envoyproxy/envoy/issues/2800.

kyessenov commented 1 year ago

@PiotrSikora thanks, early data seems like something that requires a read buffer if we want to delay upstream connection until the handshake completes. Will need to look into details of the SSL socket.

ggreenway commented 1 year ago

There are a number of subtleties I remember, although I don't recall all the details:

alyssawilk commented 1 year ago

coming back to this, any reason we didn't just have a config option / runtime guard to establishUpstreamConnection when the data reached the TCP proxy session?

kyessenov commented 1 year ago

@alyssawilk I think we could add such an option to tcp_proxy. But today, the downstream TLS handshake does not start/finish until upstream is selected (tcp_proxy has to call read for the transport socket which it doesn't until after upstream is selected). Issues (3) and (4) in @ggreenway's post are present AFAICT. Waiting on data has to account for server-first protocols BTW, so it's really about timing the connection establishment.

alyssawilk commented 1 year ago

(3) "we don't want to cancel a partially complete TLS handshake if the upstream fails" why? Until we have upstream selection retry we're going to abort if there's no upstream. But in any case, delaying until we have data means we'd wait until the downstream handshake is complete so I think that satisfies Greg's (3) (4) "We may not want to establish an upstream connection after just receiving a TLS Client Hello" I think my suggestion would improve this as we wouldn't establish a connection until we went through the full filter chain.

I think neither 3 or 4 are problematic but that probably means I'm misunderstanding something =P

kyessenov commented 1 year ago

@alyssawilk How would this suggestion to establish upstream in onData work for server-first protocols? We can't wait for downstream data to establish an upstream connection.

To add more context: this is indeed a non-issue for request-response protocols. For request-response protocols, the issue I'm seeing is that we can't use TLS attributes in picking upstream, which complicates some "tunneling" cases with the internal listener upstream, for listeners with multiple local identities.

alyssawilk commented 1 year ago

the original concern was establishing upstream connection for RBAC block. RBAC happens onData, so for that use case I don't see a way you could block on RBAC processing onData and handle upstream-writes-first connections at all. That's why I was suggesting it be configurable - folks who know downstream sends first and wait to not stablish on RBAC block can configure that option (though yeah would have be to be a permanent not runtime knob)

kyessenov commented 1 year ago

@alyssawilk I think the fact that RBAC happens in onData was a workaround for the issue that TLS handshake data is not available in onNewConnection. It seems best to apply it as early as possible at first, so that malicious clients don't consume inordinate amount of Envoy time and resources (as @ggreenway suggested, creating upstream with TLS is much more expensive than just opening a client connection).

LukeLaScala commented 1 year ago

Is there a workaround for this currently? Would the same issue be present in external auth?

alyssawilk commented 1 year ago

AFIK no :-(