dnslookupfamily returns ipv6 addresses for external clusters (oidc)

zetaab commented 5 days ago

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should not crash, the expected value isn't returned, etc.

I compiled new version from latest master and our OIDC is now broken.

[2024-11-20 07:57:48.870][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:391] dns resolution for cognito-idp.eu-central-1.amazonaws.com started
[2024-11-20 07:57:48.876][1][debug][dns] [source/extensions/network/dns_resolver/cares/dns_impl.cc:308] dns resolution for cognito-idp.eu-central-1.amazonaws.com completed with status 0
[2024-11-20 07:57:48.876][1][debug][upstream] [source/common/upstream/upstream_impl.cc:484] transport socket match, socket default selected for host with address [2a05:d014:32e:701:9334:4719:42de:9263]:443
[2024-11-20 07:57:48.876][1][debug][upstream] [source/common/upstream/upstream_impl.cc:484] transport socket match, socket default selected for host with address [2a05:d014:32e:700:f4dc:9de:938f:1329]:443
[2024-11-20 07:57:48.876][1][debug][upstream] [source/common/upstream/upstream_impl.cc:484] transport socket match, socket default selected for host with address [2a05:d014:32e:702:b316:2916:8253:ddff]:443
[2024-11-20 07:57:48.876][1][debug][upstream] [source/extensions/clusters/strict_dns/strict_dns_cluster.cc:201] DNS refresh rate reset for cognito-idp.eu-central-1.amazonaws.com, refresh rate 30000 ms

Like can be seen our oidc now tries to use ipv6. However, we do not have ipv6 connectivity in our cluster at all

example interfaces

/home/curl_user $ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
3: eth0@if49: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 8910 qdisc noqueue state UP qlen 1000
    link/ether 6e:6c:a3:39:7c:a5 brd ff:ff:ff:ff:ff:ff
    inet 100.125.159.107/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::6c6c:a3ff:fe39:7ca5/64 scope link 
       valid_lft forever preferred_lft forever

Repro steps:

Include sample requests, environment, etc. All data and inputs required to reproduce the bug.

compile latest main
deploy it to ipv4 cluster and use oidc provider which do have ipv6 record
oidc will be broken because it cannot fetch jwks

Note: If there are privacy concerns, sanitize the data prior to sharing.

https://github.com/envoyproxy/gateway/pull/4740 is perhaps the PR that is breaking this

Environment:

Include the environment like gateway version, envoy version and so on.

Logs:

Include the access logs and the Envoy logs.

zhaohuabing commented 5 days ago

This was introduced by https://github.com/envoyproxy/gateway/pull/4740

Auto prioritizes IPv6 over IPv4. EG should respect the IPFamily configuration in the EnvoyProxy, a resaonable DNS lookup strategy probably would be:

V4_ONLY for default/IPv4
V6_ONLY for IPv6
Auto for dualstack

If AUTO is specified, the DNS resolver will first perform a lookup for addresses in the IPv6 family and fallback to a lookup for addresses in the IPv4 family.

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#envoy-v3-api-enum-value-config-cluster-v3-cluster-dnslookupfamily-auto

cc @zirain

zetaab commented 5 days ago

I reverted PR https://github.com/envoyproxy/gateway/pull/4740 from main, and now my OIDC is back working again.

So AUTO is not correct if the oidc do have ipv6 dns records and cluster do have only ipv4. Funny that envoyproxy does not check the interfaces that it has, it just cannot work like this. IMO envoyproxy should also fallback to ipv4 with AUTO setting because it does not have ipv6 interface

It looks like it is possible to change the behaviour with https://github.com/envoyproxy/gateway/blob/main/api/v1alpha1/envoyproxy_types.go#L149

However, that says

// If not specified, the system will operate as follows:
// - It defaults to IPv4 only.

that is not true now.

zirain commented 5 days ago

I recall it's designed for listener address, we maybe need another knob for your case.

zirain commented 5 days ago

a work around would be create a envoyproxy with IPFamily IPv4 and point to gatewayclass or gateway

zhaohuabing commented 5 days ago

I recall it's designed for listener address, we maybe need another knob for your case.

I think we can use the current IPFamily in the EnvoyProxy for both the listener and DNS lookup IPFamily. The below behavior would be sufficent for most of the use cases as the IPFamily of the Gateway Listener and the Gateway pod is typically consistent in most environments.

// IPFamily specifies the IP family for the EnvoyProxy fleet. // This setting affects the Gateway listener port and the DNS resolver for the EnvoyProxy fleet. // - IPv4 Gateway will listen on IPv4 addresses only, and the DNS resolver will resolve to IPv4 addresses only. // - IPv6 Gateway will listen on IPv6 addresses only, and the DNS resolver will resolve to IPv6 addresses only. // - DualStack Gateway will listen on both IPv4 and IPv6 addresses, and the DNS resolver will prefer IPv6 addresses over IPv4 addresses. // - If unspecified, the default IP family is IPv4. IPFamily *IPFamily json:"ipFamily,omitempty"

A dedicated configuration knob for DNS lookup family can be added later if people ask for it.

zirain commented 5 days ago

@zetaab can you try with V4_PREFERRED as default value on your cluster?

zirain commented 5 days ago

https://github.com/envoyproxy/gateway/pull/4690/commits/3b265169ee3579c77a3dc9ab196e64fcefebc76e passed on CI.

arkodg commented 5 days ago

+1 to V4_PREFERRED as default to maintain backwards compatibility

zhaohuabing commented 4 days ago

@zirain @arkodg I think V4_PREFERRED won't work for IPv6 env where the envoy pod only has an IPv6 address.

If V4_PREFERRED is specified, the DNS resolver will first perform a lookup for addresses in the IPv4 family and fallback to a lookup for addresses in the IPv6 family.

alrai commented 4 days ago

I encountered the following error in a pod deployed by the Gateway:

$ kubectl logs -f envoy-envoy-gateway-envoy-gateway-9dbc5803-66c67d8d54-pvmgb -n envoy-gateway
Defaulted container "envoy" out of: envoy, shutdown-manager
[2024-11-18 18:42:12.465][1][warning][misc] [source/extensions/filters/network/http_connection_manager/config.cc:88] internal_address_config is not configured. The existing default behaviour will trust RFC1918 IP addresses, but this will be changed in next release. Please explictily config internal address config as the migration step or config the envoy.reloadable_features.explicit_internal_address_config to true to untrust all ips by default
[2024-11-18 18:42:27.573][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.cluster.v3.Cluster
[2024-11-18 18:42:42.573][1][warning][config] [source/extensions/config_subscription/grpc/grpc_subscription_impl.cc:130] gRPC config: initial fetch timed out for type.googleapis.com/envoy.config.listener.v3.Listener
[2024-11-18 18:42:50.159][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 37s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: immediate connect error: Network is unreachable|remote address:[2a02:6b8::242]:18000
[2024-11-18 18:43:01.488][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:226] DeltaAggregatedResources gRPC config stream to xds_cluster closed since 48s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: immediate connect error: Network is unreachable|remote address:[2a02:6b8::242]:18000

It tries to connect to some unknown IPv6 address even though I have a single-stack k8s cluster and all pods/services have only IPv4 addresses.

NAME                                                 TYPE           CLUSTER-IP     EXTERNAL-IP     PORT(S)                                   AGE
service/envoy-envoy-gateway-envoy-gateway-9dbc5803   LoadBalancer   10.43.89.115   93.125.75.111   443:31726/TCP,8443:32345/TCP              46h
service/envoy-gateway                                ClusterIP      10.43.48.170   <none>          18000/TCP,18001/TCP,18002/TCP,19001/TCP   2d

Is that error caused by the same issue?

zetaab commented 4 days ago

@alrai yes, its same issue

zirain commented 4 days ago

@zetaab can you try with https://github.com/envoyproxy/gateway/pull/4745?

zetaab commented 3 days ago

I can but in next week

envoyproxy / gateway

dnslookupfamily returns ipv6 addresses for external clusters (oidc) #4744