PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.72k stars 912 forks source link

Feature request: Add vrf support #8284

Open krombel opened 5 years ago

krombel commented 5 years ago

Usecase

At the moment it is not possible to define per service on which VRF they should listen on. This applies to the Authoriative, Recursor.

The main use case for me is the webserver component: I want it running on localhost (to e.g. allow metrics scraping) but not my VRF in which the service should listen to because that would require using a public ip address

When I use e.g. ip vrf exec external /path/to/recursor [...] I am enforced to have the webserver running on a public interface as well.

When I try to let it listen on 127.0.0.1 (started with ip vrf exec ...) it fails with

Unable to bind to webserver socket on 127.0.0.1:8080: binding socket to 127.0.0.1:8080: Cannot assign requested address

When I do not use the vrf the service is not publicly accessible

Description

Have an option to bind to vrf per service. If there is such option already only the doc for this is missing

Habbie commented 5 years ago

For dnsdist, can you tell us if the interface option at https://dnsdist.org/reference/config.html#addLocal covers your needs?

krombel commented 5 years ago

You are right. Adding "interface=external" worked as expected. Have missed that option. Thanks for clearing things up :slightly_smiling_face:

krombel commented 5 years ago

I just realize, that I missed that this config is only available for dnsdist. How is this working for authoritative and recursor?

Habbie commented 5 years ago

How is this working for authoritative and recursor?

I don't think they can do it right now.

awlx commented 5 years ago

It also seems that the newServer call cannot handle vrfs.

If I add for example:

newServer({address="2001:608:a01::40", name="gw01", source="vrf_external"}) -- downstream servers for recursion

It still picks the outgoing ip of the master vrf. And if I define an IP with ip@vrf_external it cannot bind anymore :/.

Oct 02 12:00:16 webfrontend01 dnsdist[27956]: Fatal Lua error: [string "chunk"]:22: Caught exception: binding socket to [2001:608:a01::3]:0: Cannot assign requested address

rgacogne commented 5 years ago

If you only specify the interface we ask the kernel to use that interface, and let the choice of the IP to use to it. Specifying both the address and interface works here, your message means that the bind() call returned -1, setting errno to EADDRNOTAVAIL, which implies that 2001:608:a01::3 does not exist on the system.

awlx commented 5 years ago

The IP exists on the interface and on the system just in a different vrf and thus seems not to be seen by dnsdist at startup.

rgacogne commented 5 years ago

The output of ip addr show and perhaps a strace of the dnsdist process during startup would be useful, as I believe bind() should succeed here if the address exists.

awlx commented 5 years ago

Here you go:

webfrontend01.in.ffmuc.net:~# ip -br link show master vrf_external
vlan3            UP             52:54:00:c7:81:c1 <BROADCAST,MULTICAST,UP,LOWER_UP> 
webfrontend01.in.ffmuc.net:~# ip vrf 
Name              Table
-----------------------
vrf_external      1023
webfrontend01.in.ffmuc.net:~# ip -br a
lo               UNKNOWN        127.0.0.1/8 10.80.255.19/32 2001:608:a01:ffff::19/128 ::1/128 
vlan1000         UP             10.80.248.7/27 2001:608:a01:ff02:5054:ff:febe:5dbc/64 fe80::5054:ff:febe:5dbc/64 
vlan3            UP             195.30.94.28/29 2001:608:a01::3/64 2001:608:a01::27/64 fe80::5054:ff:fec7:81c1/64 
vrf_external     UP

Either with: newServer({address="2001:608:a01::40", name="gw01", source="2001:608:a01::3@vrf_external"}) or

newServer({address="2001:608:a01::40", name="gw01", source="2001:608:a01::3@vlan3"})

setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::3", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
close(6)                                = 0
close(5)                                = 0
close(4)

so neither specifying the VRF nor the interface in the VRF helps

rgacogne commented 5 years ago

Would you by any chance be able to test after setting these two settings?

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1

This might be needed to be able to bind a socket to a different VRF context that the one the program executes in, ie the default one. Apparently an other option would to run dnsdist in the right VRF context via ip vrf exec....

awlx commented 5 years ago

Would you by any chance be able to test after setting these two settings?

sysctl -w net.ipv4.tcp_l3mdev_accept=1
sysctl -w net.ipv4.udp_l3mdev_accept=1

Those settings are default for all our VMs with VRF :). So it's already enabled.

This might be needed to be able to bind a socket to a different VRF context that the one the program executes in, ie the default one.

Nope, it's not needed for binding it's needed to get the correct interface to answer on if you are bound to :: or 0.0.0.0 aka global listen socket.

tcp_l3mdev_accept - BOOLEAN Enables child sockets to inherit the L3 master device index. Enabling this option allows a "global" listen socket to work across L3 master domains (e.g., VRFs) with connected sockets derived from the listen socket to be bound to the L3 domain in which the packets originated. Only valid when the kernel was compiled with CONFIG_NET_L3_MASTER_DEV. Default: 0 (disabled)

Apparently an other option would to run dnsdist in the right VRF context via ip vrf exec....

Yeah, we did that before but that breaks the possibility to use it outside of the VRF :). So it should just set the correct SOCKOPTS to bind :). Like it does for the listen contexts.

addTLSLocal("0.0.0.0", ssl_cert, ssl_key, { doTCP=true, reusePort=true, interface="vrf_external" })

rgacogne commented 5 years ago

Well, "just set the correct SOCKOPTS to bind" is easy to say but not very well documented, unfortunately. The documentation at 1 states that specifying the output interface using cmsg and IP_PKTINFO like we already do should be enough, but I guess it's a lie. That's too bad because it's the only portable way of doing it. Is there any chance you could test the patch at https://github.com/PowerDNS/pdns/pull/8372 ? It tries to do it the Linux way (SO_BINDTODEVICE) in addition to the existing one but I don't have a setup to test it right now.

awlx commented 5 years ago

I will try to compile it on one of our servers in the next days and report back.

awlx commented 5 years ago

If I am not wrong I am using the version from your Draft now. But it doesn't seem to work yet.

socket(AF_INET6, SOCK_DGRAM, IPPROTO_IP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
close(6)                                = 0
close(5)                                = 0
close(4)                                = 0
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0), ...}) = 0
write(1, "Fatal Lua error: [string \"chunk\"]:22: Caught exception: binding socket to [2001:608:a01::1]:0: Cannot assign requested address\n", 127Fatal Lua error: [string "chunk"]:22: Caught exception: binding socket to [2001:608:a01::1]:0: Cannot assign requested address
) = 127
exit_group(1)                           = ?
+++ exited with 1 +++

legacygw.in.ffmuc.net:~# /usr/local/bin/dnsdist --version
dnsdist 0.0.17744.0.ddistvrfitf.g70b0d0e296 (Lua 5.3.3)
Enabled features: ebpf ipcipher recvmmsg/sendmmsg 

Config looks like this: newServer({address="2001:608:a01::40", name="gw01", source="2001:608:a01::1@vlan3"}) -- downstream servers for recursion

ip is there: vlan3 UP 195.30.94.27/29 2001:608:a01::53/64 2001:608:a01::1/64 fe80::5054:ff:fe02:a677/64

BarbarossaTM commented 5 years ago

That look's like setsockop (SO_BINDTODEV..) isn't called. What does ltrace say about that?

Maybe run ltrace -e setsockopt for more readabilty :-)

Looking at the patch linked about I'm wondering where #ifdef SO_BINDTODEVICE get's defined and if it is on your system. Maybe remove the `#ifdef' and try again (just a hunch).

rgacogne commented 5 years ago

I just found a bug in the patch (duplicate variable name leading to a variable being shadowed, not sure why it's not reported by the compiler), I'll update it soon. Sorry about that, I tested a previous version and introduced the bug when cleaning it up..

rgacogne commented 5 years ago

Updated.

awlx commented 5 years ago

Ok, that looks better at startup:

setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
setsockopt(7, SOL_SOCKET, SO_BINDTODEVICE, "vlan3", 5) = 0
bind(7, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
connect(7, {sa_family=AF_INET6, sin6_port=htons(53), inet_pton(AF_INET6, "2001:608:a01::40", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
mmap(NULL, 17825792, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4eb78e0000
fstat(1, {st_mode=S_IFCHR|0600, st_rdev=makedev(0x88, 0x1), ...}) = 0
write(1, "Added downstream server [2001:60"..., 46Added downstream server [2001:608:a01::40]:53
) = 46
close(5)

But it seems it does a rebind(?) every once in a while which does not call that function?

[pid 14073] setsockopt(17, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 14073] setsockopt(17, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 14073] bind(17, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
[pid 14073] close(17)                   = 0
[pid 14073] nanosleep({tv_sec=1, tv_nsec=0},  <unfinished ...>

Additional ltrace output:

legacygw.in.ffmuc.net:~/pdns/pdns/dnsdistdist# ltrace -e setsockopt /usr/local/bin/dnsdist --supervised --disable-syslog -u _dnsdist -g _dnsdist -C /etc/dnsdist/dnsdist.conf 
dnsdist->setsockopt(7, 1, 2, 0x7fff5c28af94)                                                                                                                                                                                     = 0
dnsdist->setsockopt(7, 1, 25, 0x55f9cd41b490)                                                                                                                                                                                    = 0
Added downstream server [2001:608:a01::40]:53
dnsdist->setsockopt(5, 1, 2, 0x7fff5c28b584)                                                                                                                                                                                     = 0
dnsdist->setsockopt(8, 1, 2, 0x7fff5c28b8a4)                                                                                                                                                                                     = 0
Calling setKey() while libsodium support has not been enabled is not secure, and will result in cleartext communications
dnsdist->setsockopt(4, 0, 15, 0x7fff5c28c51c)                                                                                                                                                                                    = 0
dnsdist->setsockopt(4, 0, 10, 0x7fff5c28c2b4)                                                                                                                                                                                    = 0
dnsdist->setsockopt(9, 1, 2, 0x7fff5c28c2c4)                                                                                                                                                                                     = 0
dnsdist->setsockopt(9, 6, 9, 0x7fff5c28c2c4)                                                                                                                                                                                     = 0
dnsdist->setsockopt(9, 0, 15, 0x7fff5c28c51c)                                                                                                                                                                                    = 0
Listening on 127.0.0.1:53
dnsdist 0.0.17744.0.ddistvrfitf.g70b0d0e296 comes with ABSOLUTELY NO WARRANTY. This is free software, and you are welcome to redistribute it according to the terms of the GPL version 2
ACL allowing queries from: 0.0.0.0/0, ::/0
Console ACL allowing connections from: ::1/128, 127.0.0.1/8
Warning, this configuration can use more than 5013 file descriptors, web server and console connections not included, and the current limit is 1024.
You can increase this value by using ulimit.
Webserver launched on 10.80.248.21:8127
dnsdist->setsockopt(13, 1, 2, 0x7fff5c28c18c)                                                                                                                                                                                    = 0
dnsdist->setsockopt(13, 1, 2, 0x7fff5c28c1ac)                                                                                                                                                                                    = 0
Marking downstream gw01 ([2001:608:a01::40]:53) as 'down'
Accepting control connections on 127.0.0.1:5199
Error while retrieving the security update for version dnsdist-0.0.17744.0.ddistvrfitf.g70b0d0e296: Unable to get a valid Security Status update
Not validating response for security status update, this is a non-release version.
rgacogne commented 5 years ago

But it seems it does a rebind(?) every once in a while which does not call that function?

It's the health check, which creates a new socket every time. Of course we need to apply the new socket option there as well, on it!

rgacogne commented 5 years ago

Pushed! Thanks a lot for the feedback, this is very much appreciated!

awlx commented 5 years ago

No problem :). I want the feature so it's my responsibility to help to test it if someone has the time and mood to implement it :).

Thank you very much for this!

It seems it doesn't work yet, it's not using the SO_BINDTODEVICE yet.

[pid  2475] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid  2475] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid  2475] bind(18, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
[pid  2475] close(18)       
rgacogne commented 5 years ago

Should be better now, the option needs to be applied before we attempt to bind!

awlx commented 5 years ago

It seems now the setting gets overwritten right after setting it. And we get a operation not permitted ... maybe because privileges get dropped before?

[pid 10800] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 10800] setsockopt(18, SOL_SOCKET, SO_BINDTODEVICE, "vlan3", 5) = -1 EPERM (Operation not permitted)
[pid 10800] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 10800] bind(18, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
[pid 10800] close(18)                   = 0
rgacogne commented 5 years ago

I'm not sure why you think it's overwritten? But yes, you are right, we have an issue here because setting SO_BINDTODEVICE requires CAP_NET_RAW which we don't have anymore at this point. I'm actually not sure how we should fix that..

awlx commented 5 years ago

I'm not sure why you think it's overwritten?

Actually thought the wrong way sorry :). Was about to edit it but you were faster ;).

But yes, you are right, we have an issue here because setting SO_BINDTODEVICE requires CAP_NET_RAW which we don't have anymore at this point. I'm actually not sure how we should fix that..

Is it possible to bind the healthcheck socket once at startup and keep it? But I think that would miss the point of the feature :/.

rgacogne commented 5 years ago

Is it possible to bind the healthcheck socket once at startup and keep it? But I think that would miss the point of the feature :/.

It would be possible to do so but that might prevent us from detecting some network issues in the health check. I think our best option would be to keep CAP_NET_RAW when at least one backends has specified a source interface. It would require a bit of work but that seems doable.

awlx commented 5 years ago

That sounds like a good solution.

rgacogne commented 5 years ago

I pushed a commit to do just that, note that it is quite experimental and might require updating CapabilityBoundingSet if you use our systemd unit file.

awlx commented 5 years ago

At the moment I am just running it as root directly from the shell is that enough or am I missing something? It seems it's not getting passend on.

[pid  5478] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid  5478] setsockopt(18, SOL_SOCKET, SO_BINDTODEVICE, "vlan3", 5) = -1 EPERM (Operation not permitted)
[pid  5478] setsockopt(18, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid  5478] bind(18, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = -1 EADDRNOTAVAIL (Cannot assign requested address)
[pid  5478] close(18)      
rgacogne commented 5 years ago

That's weird, it does work for me™. Could you strace from the beginning? I'm interested in the capget and capset syscalls. Could you also run it without -u or -g, just to be sure?

rgacogne commented 5 years ago

Right, with -u or -g we lose all capabilities as soon as we call setuid(). The code to keep the capability around only works when started under systemd with User= and CapabilityBoundingSet= correctly set, or if you start as root without -u or -g.

awlx commented 5 years ago

Yep I had -u and -g set now it works!

[pid 10916] setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 10916] setsockopt(12, SOL_SOCKET, SO_BINDTODEVICE, "vlan3", 5) = 0
[pid 10916] setsockopt(12, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
[pid 10916] bind(12, {sa_family=AF_INET6, sin6_port=htons(0), inet_pton(AF_INET6, "2001:608:a01::1", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
[pid 10916] connect(12, {sa_family=AF_INET6, sin6_port=htons(53), inet_pton(AF_INET6, "2001:608:a01::40", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, 28) = 0
[pid 10916] sendmsg(12, {msg_name={sa_family=AF_INET6, sin6_port=htons(53), inet_pton(AF_INET6, "2001:608:a01::40", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="\20\34\1\0\0\1\0\0\0\0\0\0\1a\froot-servers\3net\0"..., iov_len=36}], msg_iovlen=1, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32}], msg_controllen=40, msg_flags=0}, 0) = 36
[pid 10916] poll([{fd=12, events=POLLIN}], 1, 1000) = 1 ([{fd=12, revents=POLLIN}])
[pid 10916] recvfrom(12, "\20\34\201\200\0\1\0\1\0\0\0\0\1a\froot-servers\3net\0"..., 4096, 0, {sa_family=AF_INET6, sin6_port=htons(53), inet_pton(AF_INET6, "2001:608:a01::40", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28]) = 52

And the backend is up!

Screenshot 2019-10-03 at 17 53 49

And the healthchecks use the correct interface!

Screenshot 2019-10-03 at 17 55 00