envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.96k stars 4.8k forks source link

`failure_mode_allow=true` is not fully supported for the grpc authz server. #34705

Closed konstantin-baidin-y42 closed 1 month ago

konstantin-baidin-y42 commented 4 months ago

Title: failure_mode_allow=true is not fully supported for the grpc authz server.

Description: There is inconsistency in how the failure_mode_allow=true option is applied to external authorization servers depending on the protocol: http and grpc.

In case of http authz server, any 5xx statuses lead to accepting the incoming request (fail-open).

In case of grpc authz server, statuses like 14 (Unavailable) lead to rejecting the incoming request. If there is any grpc proxy between envoy and authz server, then this proxy can return 14 (Unavailable) when authz server is unavailable. It leads to the fail-close behaviour even when failure_mode_allow is enabled. As a result, we have random rejections while with failure_mode_allow we want the opposite - don't reject requests because of random failures.

Expected behaviour is to use failue_mode_allow when receiveng grpc statuses like 2 (Unknown), 13 (Internal), 14 (Unavailable).

The following PR https://github.com/envoyproxy/envoy/pull/4199 was intended to handle these cases when failure_mode_allow=true and authz server is unhealthy, but it was solved for HTTP authz server only.

Repro steps:

  1. Configure envoy to have failure_mode_allow=true
  2. Make the authz server to return grpc status 14 (Unavailable)
  3. Perform http request to a service in the cluster
  4. Envoy returns 403 (fail-close), while it was expected to perform fail-open.
alyssawilk commented 4 months ago

cc @esmet @tyxia @ggreenway

ggreenway commented 4 months ago

Seems likely that these error codes are getting misinterpreted as a denied response, or hitting some alternate error path from the normal one.

konstantin-baidin-y42 commented 4 months ago

Hello, I tested it manually with logging level tracing and found that responses with error status codes are processed in the GrpcClientImpl::onSuccess callback instead of the GrpcClientImpl::onFailure. So I added a check of the status code and similar actions as in the onFailure callback. I made a PR with a draft of the fix. Can you please take a look? https://github.com/envoyproxy/envoy/pull/34951

konstantin-baidin-y42 commented 2 months ago

Since my fix was reverted in https://github.com/envoyproxy/envoy/pull/35637, can you please re-open this issue again? @ggreenway Or should I create a duplicate one?

konstantin-baidin-y42 commented 2 months ago

Also, do I understand right that this time I should create the same PR, just with the runtime guard to be false by default as I did initially?

ggreenway commented 2 months ago

You'd need to make it with a config setting to control the behavior, which defaults to the old behavior. Runtime flags are meant to be temporary.

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 month ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.