Closed yangminzhu closed 4 months ago
cc @lambdai
Also want to add that L4 RBAC is likely work with a TLS connection, which is cpu extensive (1ms cpu time) The delayed deny as a start point of back pressure propogation potentially highly reduce the CPU at both envoy and envoy's downstream
Wouldn't it be better to apply pressure earlier, e.g. by not reading bytes and starting TLS handshakes when there's a flood of connections? A delayed deny would mean Envoy has to maintain the memory structures for the connection when we'd want to shed them quickly.
Wouldn't it be better to apply pressure earlier, e.g. by not reading bytes and starting TLS handshakes when there's a flood of connections? A delayed deny would mean Envoy has to maintain the memory structures for the connection when we'd want to shed them quickly.
@kyessenov I think we will need both. The dealyed deny in RBAC is specific to connections to be closed due to permission error, and will be more effective to reduce the CPU usage on Envoy in some situations, for example, some gRPC clients retry in a busy for-loop when it is closed by RBAC, this creates signigicant number of new connections (e.g. 400 per second per client) on Envoy for more CPU usage (we are not really worried about the memory as it doesn't look to be an issue in either case).
A delayed deny will naturaully reduce how fast the client is to retry, then siginificantly reduce the CPU usage since there is much less new connection being created/closed at the same time.
SG, although this principle of delayed close should probably be applied uniformly: TLS handshake failures, protocol errors, etc all fails in the same error domain. It might be better to handle it at listener or the HCM level.
CC @yanavlasov
"When the RBAC policy evaluation result is DENY. The RBAC network filter will close the TCP connection immediately. This doesn't handle very well for some clients that just retry with a new connection at a very high rate, and that could overload the Envoy proxy to high CPU usage."
1.what's the real reason here for envoy proxy to high CPU usage ? Just for handler some clients DENY and close the tcp connection ? As you had said some clients doesn't handler well and just retry with new connection. So can we think this is client issue ?
2.From Envoy Pov, may be some protection policy or connection limit should be assigned to the same client not just only add connection close delay.
@wufanqqfsc The TLS handshake is CPU intensive. Both client and server. Imagine you have a service behind envoy with huge fan-in.
It is a client issue, but you don't nessessarily have the full control of the clients.
@wufanqqfsc The TLS handshake is CPU intensive. Both client and server. Imagine you have a service behind envoy with huge fan-in.
It is a client issue, but you don't nessessarily have the full control of the clients.
Yes, so i mean may be some connection limit policy can be assigned to the same client to avoid same client retry with the connection & closed by envoy. Such as we can limit the connection frequency for the same client if the RBAC is failed during some time slots. Anyway, add delayed deny may help but also cost memory to keep the connection context.
Yes, so i mean may be some connection limit policy can be assigned to the same client to avoid same client retry with the connection & closed by envoy. Such as we can limit the connection frequency for the same client if the RBAC is failed during some time slots. Anyway, add delayed deny may help but also cost memory to keep the connection context.
yeah, we can definitily enable different protections at multiple layer and places depending on the actual situation, it's not a one for all solution. The memory cost should be very small and the benefit (as compared to not having the delayed deny) is well worth it.
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.
Title: Support delayed deny in the RBAC network filter
Description: When the RBAC policy evaluation result is DENY. The RBAC network filter will close the TCP connection immedidately. This doesn't handle very well for some clients that just retry with a new connection at a very high rate, and that could overload the Envoy proxy to high CPU usage.
We propose extending the RBAC network filter to delay a small amount of time (for example, 500ms, this will be configurable) before closing the TCP connection. For those clients this will limit the rate it retries with new connection.
This is especially useful for the RBAC network filter, because unlike the RBAC HTTP filter, it can only close the TCP connection and doesn't have a good way to propagate back the permission denial error (that is not retryable) to the client.
This is the same feature as the connection_limit that also implements the delayed rejecting functionality.
The following is a proposed API change to the RBAC network filter:
[optional Relevant Links:]