Issue with Cloudflared using NSX-T FQDN Filtering rules

stalkntom commented 1 year ago

Running Cloudflared as a Windows Service on Server 2019. Tunnels are Healthy to start but over a period will eventually show as Degraded. Finally they will go Down.

Servers have NSX-T DNS and FQDN allow list rules: https://docs.vmware.com/en/VMware-NSX-T-Data-Center/3.2/administration/GUID-63262728-CA72-47D2-8E4F-16617B63A9A4.html

Which is allowing traffic to the Ports and IPs: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/do-more-with-tunnels/ports-and-ips/

These rules depend on the Service resolving the endpoints based on the DNS TTL to populate the FQDN allow list. When the Tunnel shows as down or degraded, running nslookup region1.v2.argotunnel.com and/or nslookup region2.v2.argotunnel.com brings them back to a healthy state. Alternatively restarting the service entirely will achieve the same.

While the Tunnels are degraded/down logs present as: {"level":"warn","connIndex":2,"time":"2023-03-21T23:45:31Z","message":"If this log occurs persistently, and cloudflared is unable to connect to Cloudflare Network with quic protocol, then most likely your machine/network is getting its egress UDP to port 7844 (or others) blocked or dropped. Make sure to allow egress connectivity as per https://developers.cloudflare.com/cloudflare-one/connections/connect-apps/configuration/ports-and-ips/ If you are using private routing to this Tunnel, then UDP (and Private DNS Resolution) will not workunless your cloudflared can connect with Cloudflare Network with quic."} {"level":"info","connIndex":2,"time":"2023-03-21T23:45:31Z","message":"Switching to fallback protocol http2"} {"level":"error","connIndex":0,"error":"DialContext error: dial tcp 198.41.200.233:7844: i/o timeout","time":"2023-03-21T23:45:41Z","message":"Unable to establish connection with Cloudflare edge"} {"level":"error","connIndex":0,"error":"DialContext error: dial tcp 198.41.200.233:7844: i/o timeout","time":"2023-03-21T23:45:41Z","message":"Serve tunnel error"} {"level":"info","connIndex":0,"time":"2023-03-21T23:45:41Z","message":"Retrying connection in up to 32s seconds"}

While the connections fail, we see NSX-T dropping the traffic as the FQDN rules are not applying. Once nslookup has resolved the endpoints, the FQDN rules update and begin allowing traffic.

Looking to understand how overtime the Tunnels are resolving the endpoints. Is this done over the tunnel or is the Server itself supposed to be doing the resolution? Is it respecting the TTL?

We are still running an older version (2022.11.0 from chocolatey). Haven't tested if this is an issue for later versions. Also haven't tested running on Linux/or container.

DevinCarr commented 1 year ago

The server hosting cloudflared is expected to allow (egress to) and resolve the region1.v2.argotunnel.com and region2.v2.argotunnel.com endpoints as they are required to do the Anycast DNS resolution of the Cloudflare network. We don't perform this over the tunnel because at startup, the tunnel has no connection to Cloudflare, so it's first operation is to attempt to perform this connection. Even once the connection is acquired for a tunnel, it is still expected to be able to resolve these DNS records and does not resolve them over the tunnel.

DevinCarr commented 1 year ago

I did forget, the very first request is a SRV lookup to _v2-origintunneld._tcp.argotunnel.com which returns the region1.v2.argotunnel.com and region2.v2.argotunnel.com. So maybe try and see if that request is being blocked?

stalkntom commented 1 year ago

Thanks for clarifying. In a down state a Windows Service Restart brings the tunnel back to healthy. I assume this forces the lookup of addresses. What other conditions will cloudflared resolve these addresses? When querying this to the NSX-T Engineers it is suggested that cloudflared is not respecting the DNS TTL. This is what the FQDN allowlist rule is based on.

I can confirm SRV lookup to _v2-origintunneld._tcp.argotunnel.com is working. I don't think this is an issue with blocking of DNS requests. The DNS lookups are always successful. On successful lookup the FQDN Allowlist rules are being updated and allowing traffic.

The issue is that overtime these FQDN Allowlist rules stop working. It appears that the VMs stop resolving the addresses altogether. Which is where the Service restart or manual nslookup brings back functionality. I've put in place a scheduled task to Script this and log results. I'll check Monday and see if anything stands out.

cloudflare / cloudflared

Issue with Cloudflared using NSX-T FQDN Filtering rules #922