custom dynamic forward proxy injected in istio proxy fail to update dns cache on IP change

ha62791 commented 3 months ago

I have applied custom EnvoyFilter in below k8s with istio.

kubernetes version: 1.26 istio version: 1.17.2

The purpose is to decrypt the https traffic, do some extra handling such as inject extra http header, then send it to the target destination again via TLS. The expectation is the filter should always forward the request to the latest IP of the destination host when upstream DNS server is returning new IP. It is ok that the filter has its own dns cache as long as the async dns lookup process on behind can always keep the dns cache updated with lastest IP.

However, the actual result is the dynamic forward proxy always forward the request to the old IP before its dns cache expire, even on behind it has its own async dns lookup process to keep its dns cache up-to-date.

The setting is like below:

testing-yml.zip

EnvoyFilter (see attached envoy-filter-add-header-test.yml) The filter listens to localhost:5443. Any TLS traffic that match *.xxxx.yyyy.com will be decrypted and have header injected by my custom lua code. Then it will route the request to "my_custom_dynamic_forward_proxy" which will encrypt the traffic and send to the target destination again. The filter setting includes output a second log for me to check which destination IP the dynamic forward proxy is sending to.

To test this filter, I have applied a service entry (see attached my-test-service-entry.yml) It will route any traffic to *.xxxx.yyyy.com to localhost:5443 for processing by the custom filter

Below is what I have found for the filter's behaviour:

For the destination endpoint test.xxxx.yyyy.com, it has two IPs, one as active (14.22.5.6), one as standby (10.2.3.4). The upstream DNS server that coredns is pointing will failover to the standby IP for test.xxxx.yyyy.com when the active one is not healthy.

When I start to run a curl in application container to e.g. test.xxxx.yyyy.com, I can see from coredns log of the k8s that there is a repeating dns lookup for test.xxxx.yyyy.com from the pod which runs every 5 or 30s. If after the first curl and I do nth afterwards, the repeating dns lookup will stop after 5 minutes, which matches the default value of host_ttl for the dns config of the dynamic forward proxy. If after the first curl and I do some extra curls to the same endpoints within the first 5 mins from the first curl (e.g. keep curl until reach 2 mins), then the repeating dns lookup will not stop exactly at 5 mins, but at 7 mins, which is 5 mins after the last curl.

My assumption is that the forward proxy has its own dns cache, and the repeating dns lookup will keep its cache table up-to-date for the latest IP test.xxxx.yyyy.com

When I start to do a failover test, I restarted the application pod, so the dns cache of the dynamic forward proxy in the istio-proxy container will be cleaned. Then I start to curl repeatingly every 1 seconds to test.xxxx.yyyy.com in app container. Every curl will output two logs from istio-proxy container. log1 is from istio proxy default logging, and tells me about the request from app container is routed to localhost:5443 of the istio proxy container. log2 is from the custom envoy filter setting, which tells me about the request is forwarded to the external destination IP by the dynamic forward proxy.

During the repeating curls within the first 5 mins, I stopped the active side of test.xxxx.yyyy.com so the upstream DNS server will start to return standby IP to coredns, which then return to application container and istio proxy.

Around 30s after the failover, I can start to see "downstream_local_address" of log1 switched to standby IP (10.2.3.4), which indicates coredns is already returning new IP to dns query from the app pod. However, from "upstream_host" of log2, I can still see dynamic forward proxy is forwarding the request to the old IP (14.22.5.6), and standy side of test.xxxx.yyyy.com still fails to receive any inbound request. (see log1 and log2 in below)

Failover never success as long as the repeating curl continues. The failover only works after i stopped the repeating curls, and wait 5mins after the last curl, which the repeating 5 or 30s dns lookup from the filter also stopped, then I do the curl again, then the logs will show the filter forward to the standby IP.

From here I see the async dns lookup process on behind does not keep the cache record up-to-date. I also tried to set host_ttl to 1s, but it will still fail to do failover as it seems it still need idle 5-6s for the cache to expire, and it is unreasonable for application to stop sending request just for cache expire for failover.

log 1: 2024 - 06 - 03T04: 03: 46.700082870Z { "x_forwarded_for": null, "request_header_bytes": 0, "path": null, "duration": 20, "connection_termination_details": null, "xff_srcport": null, "upstream_service_time": null, "x_spiffe_header": null, "response_flags": "-", "bytes_received": 743, "user_agent": null, "upstream_transport_failure_reason": null, "requested_server_name": "test.xxxx.yyyy.com", "downstream_remote_address": "10.42.4.164:42488", "method": null, "request_id": null, "response_header_bytes": 0, "protocol": null, "response_code_details": null, "upstream_cluster": "outbound|443||*.xxxx.yyyy.com", "bytes_sent": 4426, "start_time": "2024-06-03T04:03:46.531Z", "upstream_local_address": "127.0.0.1:45362", "authority": null, "downstream_local_address": "10.2.3.4:443", "attempt-count": null, "upstream_host": "127.0.0.1:5443", "route_name": null, "response_code": 0 }

log2: 2024 - 06 - 03T04: 03: 46.699763214Z { "x_forwarded_for": null, "request_id": "ce45dc56-34da-4d86-98ac-a9f1e0cff0e9", "requested_server_name": "test.xxxx.yyyy.com", "path": "/", "method": "GET", "response_flags": "-", "response_code_details": "via_upstream", "user_agent": "curl/7.61.1", "connection_termination_details": null, "downstream_local_address": "127.0.0.1:5443", "authority": "test.xxxx.yyyy.com", "upstream_transport_failure_reason": null, "start_time": "2024-06-03T04:03:46.543Z", "upstream_host": "14.22.5.6:443", "downstream_remote_address": "127.0.0.1:45362", "upstream_local_address": "10.42.4.164:46588", "upstream_service_time": "6", "duration": 6, "upstream_cluster": "my_custom_dynamic_forward_proxy", "protocol": "HTTP/1.1", "response_code": 503, "bytes_received": 0, "bytes_sent": 19, "route_name": null }

alyssawilk commented 3 months ago

If the overall concern is "Envoy keeps sending to old addresses on DNS change" that's true - we don't drain old connections so any new connections go to the new address, any old connections keep routing to the old address. You can configure max streams per connection or max connection lifetime if you want to force drains. Currently there's no option for drain on re-resolve and it'd likely have pretty poor performance as many endpoints support multiple addresses so you'd end up thrashing back and forth and dealing with cold start a lot.

ha62791 commented 3 months ago

@alyssawilk

May I know which configuration is for configure max streams per connection or max connection lifetime ?

Will there be other alternatives or any option to use the downstream_local_address as the target IP to forward again after the custom decryption and extra handling ?

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 2 months ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.

envoyproxy / envoy

custom dynamic forward proxy injected in istio proxy fail to update dns cache on IP change #34721