envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.89k stars 4.78k forks source link

Support delayed deny in the RBAC network filter #33771

Closed yangminzhu closed 4 months ago

yangminzhu commented 5 months ago

Title: Support delayed deny in the RBAC network filter

Description: When the RBAC policy evaluation result is DENY. The RBAC network filter will close the TCP connection immedidately. This doesn't handle very well for some clients that just retry with a new connection at a very high rate, and that could overload the Envoy proxy to high CPU usage.

We propose extending the RBAC network filter to delay a small amount of time (for example, 500ms, this will be configurable) before closing the TCP connection. For those clients this will limit the rate it retries with new connection.

This is especially useful for the RBAC network filter, because unlike the RBAC HTTP filter, it can only close the TCP connection and doesn't have a good way to propagate back the permission denial error (that is not retryable) to the client.

This is the same feature as the connection_limit that also implements the delayed rejecting functionality.

The following is a proposed API change to the RBAC network filter:

message RBAC {

  // The delay_deny to use for rejecting the connection after some specified time duration
  // instead of immediately rejecting the connection. That way, a malicious user is not able to
  // retry as fast as possible which provides a better DoS protection for Envoy. If this is not present,
  // the connection will be closed immediately.
  google.protobuf.Duration delay_deny = 8;
}

[optional Relevant Links:]

Any extra documentation required to understand the issue.

yangminzhu commented 5 months ago

cc @lambdai

lambdai commented 5 months ago

Also want to add that L4 RBAC is likely work with a TLS connection, which is cpu extensive (1ms cpu time) The delayed deny as a start point of back pressure propogation potentially highly reduce the CPU at both envoy and envoy's downstream

kyessenov commented 5 months ago

Wouldn't it be better to apply pressure earlier, e.g. by not reading bytes and starting TLS handshakes when there's a flood of connections? A delayed deny would mean Envoy has to maintain the memory structures for the connection when we'd want to shed them quickly.

yangminzhu commented 5 months ago

Wouldn't it be better to apply pressure earlier, e.g. by not reading bytes and starting TLS handshakes when there's a flood of connections? A delayed deny would mean Envoy has to maintain the memory structures for the connection when we'd want to shed them quickly.

@kyessenov I think we will need both. The dealyed deny in RBAC is specific to connections to be closed due to permission error, and will be more effective to reduce the CPU usage on Envoy in some situations, for example, some gRPC clients retry in a busy for-loop when it is closed by RBAC, this creates signigicant number of new connections (e.g. 400 per second per client) on Envoy for more CPU usage (we are not really worried about the memory as it doesn't look to be an issue in either case).

A delayed deny will naturaully reduce how fast the client is to retry, then siginificantly reduce the CPU usage since there is much less new connection being created/closed at the same time.

kyessenov commented 5 months ago

SG, although this principle of delayed close should probably be applied uniformly: TLS handshake failures, protocol errors, etc all fails in the same error domain. It might be better to handle it at listener or the HCM level.

CC @yanavlasov

wufanqqfsc commented 5 months ago

"When the RBAC policy evaluation result is DENY. The RBAC network filter will close the TCP connection immediately. This doesn't handle very well for some clients that just retry with a new connection at a very high rate, and that could overload the Envoy proxy to high CPU usage."

1.what's the real reason here for envoy proxy to high CPU usage ? Just for handler some clients DENY and close the tcp connection ? As you had said some clients doesn't handler well and just retry with new connection. So can we think this is client issue ?

2.From Envoy Pov, may be some protection policy or connection limit should be assigned to the same client not just only add connection close delay.

lambdai commented 5 months ago

@wufanqqfsc The TLS handshake is CPU intensive. Both client and server. Imagine you have a service behind envoy with huge fan-in.

It is a client issue, but you don't nessessarily have the full control of the clients.

wufanqqfsc commented 5 months ago

@wufanqqfsc The TLS handshake is CPU intensive. Both client and server. Imagine you have a service behind envoy with huge fan-in.

It is a client issue, but you don't nessessarily have the full control of the clients.

Yes, so i mean may be some connection limit policy can be assigned to the same client to avoid same client retry with the connection & closed by envoy. Such as we can limit the connection frequency for the same client if the RBAC is failed during some time slots. Anyway, add delayed deny may help but also cost memory to keep the connection context.

yangminzhu commented 5 months ago

Yes, so i mean may be some connection limit policy can be assigned to the same client to avoid same client retry with the connection & closed by envoy. Such as we can limit the connection frequency for the same client if the RBAC is failed during some time slots. Anyway, add delayed deny may help but also cost memory to keep the connection context.

yeah, we can definitily enable different protections at multiple layer and places depending on the actual situation, it's not a one for all solution. The memory cost should be very small and the benefit (as compared to not having the delayed deny) is well worth it.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 4 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.