NetworkConfiguration / dhcpcd

DHCP / IPv4LL / IPv6RA / DHCPv6 client.
https://roy.marples.name/projects/dhcpcd
BSD 2-Clause "Simplified" License
337 stars 109 forks source link

Zombie process #57

Closed maravtdm closed 2 years ago

maravtdm commented 3 years ago

Version : dhcpcd 9.99.0 NetworkManager : 1.32.10

Each time I logout/login or during the suspend/resume process The original parent PID of dhcpcd becomes a zombie process

before suspend

root      4955  0.0  0.1 287620 15872 ?        Ssl  17:04   0:02 /usr/sbin/NetworkManager
dhcpcd    4956  0.0  0.0   3008  2496 ?        S    16:39   0:00  \_ dhcpcd: wlan0 [ip4]
root      4957  0.0  0.0   3012  1996 ?        S    16:39   0:00      \_ dhcpcd: [privileged proxy] wlan0 [ip4]
dhcpcd    4972  0.0  0.0   3012   300 ?        S    16:39   0:00      |   \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd    4979  0.0  0.0   3012   304 ?        S    16:39   0:00      |   \_ dhcpcd: [BOOTP proxy] 192.168.111.15
dhcpcd    4958  0.0  0.0   3004   288 ?        S    16:39   0:00      \_ dhcpcd: [control proxy] wlan0 [ip4]

after resume

root      4955  0.0  0.1 287620 15872 ?        Ssl  17:04   0:02 /usr/sbin/NetworkManager
dhcpcd    4956  0.0  0.0      0     0 ?        Z    16:39   0:00  \_ [dhcpcd] <defunct>
dhcpcd    7936  0.0  0.0   3008  2496 ?        S    16:49   0:00  \_ dhcpcd: wlan0 [ip4]
root      7937  0.0  0.0   3012  1988 ?        S    16:49   0:00      \_ dhcpcd: [privileged proxy] wlan0 [ip4]
dhcpcd    7945  0.0  0.0   3012   296 ?        S    16:49   0:00      |   \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd    7951  0.0  0.0   3012   296 ?        S    16:49   0:00      |   \_ dhcpcd: [BOOTP proxy] 192.168.111.15
dhcpcd    7938  0.0  0.0   3004   284 ?        S    16:49   0:00      \_ dhcpcd: [control proxy] wlan0 [ip4]

Another one, after 2 logout/login:

root     14687  0.0  0.1 287620 15872 ?        Ssl  17:04   0:02 /usr/sbin/NetworkManager
dhcpcd   14688  0.0  0.0      0     0 ?        Z    17:04   0:00  \_ [dhcpcd] <defunct>
dhcpcd   15381  0.0  0.0      0     0 ?        Z    17:05   0:00  \_ [dhcpcd] <defunct>
dhcpcd   26767  0.0  0.0   3016  2500 ?        S    19:10   0:00  \_ dhcpcd: wlan0 [ip4]
root     26768  0.0  0.0   3020  2052 ?        S    19:10   0:00      \_ dhcpcd: [privileged proxy] wlan0 [ip4]
dhcpcd   26789  0.0  0.0   3020   308 ?        S    19:10   0:00      |   \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd   26795  0.0  0.0   3020   312 ?        S    19:10   0:00      |   \_ dhcpcd: [BOOTP proxy] 192.168.111.15
dhcpcd   26769  0.0  0.0   3008   296 ?        S    19:10   0:00      \_ dhcpcd: [control proxy] wlan0 [ip4]

The strace log during a suspend/resume : dhcpcd : https://pastebin.com/LJ9AkJxG

NetworkManager: https://pastebin.com/Xe2H80Tr

rsmarples commented 3 years ago

Does this patch to the dhcpcd plugin for NetworkManager help at all? https://gitlab.freedesktop.org/rsmarples/NetworkManager/-/commit/9cc2591fe871b29ce9a743dfb2cd189cb1580bde

If not, do any prior versions of dhcpcd work?

maravtdm commented 3 years ago

I'll try your patch and let you know. thx in advance

The distribution (slackware) provides 9.4.0 but it doesn't have your last patch :

DHCP6: Only send FQDN for SOLICIT, REQUEST, RENEW, or REBIND messages.
As per RFC 4704 section 5.
Fixes #44.

and after suspend/resume we have no ipv4 & no nameserver (ipv4) in /etc/resolv.conf

maravtdm commented 3 years ago

x100 thx !! it works very well before suspend :

dhcpcd   25557  0.0  0.0   3016  2560 ?        S    23:42   0:00  \_ dhcpcd: wlan0 [ip4]
root     25558  0.0  0.0   3020  1964 ?        S    23:42   0:00      \_ dhcpcd: [privileged proxy] wlan0 [ip4]
dhcpcd   25574  0.0  0.0   3020   304 ?        S    23:42   0:00      |   \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd   25580  0.0  0.0   3020   308 ?        S    23:42   0:00      |   \_ dhcpcd: [BOOTP proxy] 192.168.111.15
dhcpcd   25559  0.0  0.0   3008   296 ?        S    23:42   0:00      \_ dhcpcd: [control proxy] wlan0 [ip4]

after resume :

blackstar :: SRC/nm/NetworkManager-9cc2591fe871b29ce9a743dfb2cd189cb1580bde » ps faux | grep "[d]hcp"
dhcpcd   25901  0.0  0.0   3020  2528 ?        S    23:43   0:00  \_ dhcpcd: wlan0 [ip4]
root     25902  0.0  0.0   3024  2088 ?        S    23:43   0:00      \_ dhcpcd: [privileged proxy] wlan0 [ip4]
dhcpcd   25910  0.0  0.0   3024   308 ?        S    23:43   0:00      |   \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd   25916  0.0  0.0   3024   312 ?        S    23:43   0:00      |   \_ dhcpcd: [BOOTP proxy] 192.168.111.15
dhcpcd   25903  0.0  0.0   3008   296 ?        S    23:43   0:00      \_ dhcpcd: [control proxy] wlan0 [ip4]
rsmarples commented 3 years ago

Nice! Can you test that dhcpcd-9.4.0 works with the NetworkManager patch please?

maravtdm commented 3 years ago

Already done P. Volkerding applies this patch to the 9.4.0 last night : https://roy.marples.name/git/dhcpcd/commit/2fae4a113c3e736d585dd300ca6c8fddae300503 here ; http://ftp.slackware.com/pub/slackware/slackware64-current/source/n/dhcpcd/patches/dhcpcd.2fae4a113c3e736d585dd300ca6c8fddae300503.patch

And the result is : no nameserver (ipv4) until I kill dhcpcd before restarting NetworkManager (or before logout/login & resume/restart)

root:~/ # cat /etc/resolv.conf 
# Generated by NetworkManager
search blackstar.local
nameserver fd0f:ee:b0::1

root:~/ # killall dhcpcd

root:~/ # /etc/rc.d/rc.networkmanager restart 
Stopping NetworkManager: stopped
Starting NetworkManager daemon:  /usr/sbin/NetworkManager

root:~/ # cat /etc/resolv.conf 
# Generated by NetworkManager
search blackstar.local
nameserver 80.67.169.12
nameserver 80.67.169.40
nameserver fd0f:ee:b0::1

The fact is that I made all my tests with your latest dev release (9.99.0) and everything works as expected Are there any other patches we need to apply to make 9.4.0 work properly ?

rsmarples commented 3 years ago

So to be clear, as things stand now, with the above NetworkManager patch, dhcpcd-9.4.0 doesn't list IPv4 nameservers but the development branch does after a suspend resume? If so, can you test the dhcpcd-9 branch please as that will be released as dhcpcd-9.4.1 "soon".

maravtdm commented 3 years ago

Yes, it may seem confusing "with the above NetworkManager patch, dhcpcd-9.4.0 doesn't list IPv4 nameservers but the development branch does after a suspend resume?" right

In fact : Our versions: dhcpcd 9.4.0 and NetworkManager 1.32.10 (plugin dhcp=dhcpcd)

First we have an issue with IPv4 nameservers (not present in resolv.conf), so I cloned your current dev tree of dhcpcd (9.99.0) which solves THIS particular issue. And I clearly don't know exactly which patch helped (between 9.4.0 and the actual tree)

After that, everything worked fine except (after logout/login) and zombie process Which was fixed with your NetworkManager patch

So, if we don't care about zombie process, for now, the latest stable release of NetworkManager (1.32.10) is fine. (not a big deal) For me, the first and only issue we have to worry about is the nameservers

rsmarples commented 3 years ago

OK, so the only difference between dhcpcd-9 branch and master is process management. If you can test the dhcpcd-9 branch (which currently builds as version 9.4.0 still) it will hopefully work for you. Can you test this please?

maravtdm commented 3 years ago

I'll do

maravtdm commented 3 years ago

doesn't work...

blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » git checkout 
Your branch is up to date with 'origin/dhcpcd-9'.
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » dhcpcd --version
dhcpcd 9.4.0
Copyright (c) 2006-2021 Roy Marples
Compiled in features: INET ARP ARPing IPv4LL INET6 DHCPv6 AUTH PRIVSEP
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » cat /etc/resolv.conf
# Generated by NetworkManager
search blackstar.local
nameserver fd0f:ee:b0::1
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » sudo /etc/rc.d/rc.networkmanager restart
Stopping NetworkManager: stopped
Starting NetworkManager daemon:  /usr/sbin/NetworkManager
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » cat /etc/resolv.conf                    
# Generated by NetworkManager
search blackstar.local
nameserver fd0f:ee:b0::1
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » sudo killall dhcpcd  
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » sudo /etc/rc.d/rc.networkmanager restart
Stopping NetworkManager: stopped
Starting NetworkManager daemon:  /usr/sbin/NetworkManager
blackstar :: SRC/GIT/dhcpcd-9 ‹dhcpcd-9› » cat /etc/resolv.conf
# Generated by NetworkManager
search blackstar.local
nameserver 80.67.169.12
nameserver 80.67.169.40
nameserver fd0f:ee:b0::1
rsmarples commented 3 years ago

Hmmmmmmm. And yet it works with the master branch of dhcpcd? I really don't understand why.

maravtdm commented 3 years ago

yes, indeed ...

blackstar :: SRC/GIT/dhcpcd-git ‹master› » git checkout       
Your branch is up to date with 'origin/master'.
blackstar :: SRC/GIT/dhcpcd-git ‹master› » dhcpcd --version
dhcpcd 9.99.0
Copyright (c) 2006-2021 Roy Marples
Compiled in features: INET ARP ARPing IPv4LL INET6 DHCPv6 AUTH PRIVSEP
blackstar :: SRC/GIT/dhcpcd-git ‹master› » cat /etc/resolv.conf
# Generated by NetworkManager
search blackstar.local
nameserver 80.67.169.12
nameserver 80.67.169.40
nameserver fd0f:ee:b0::1
blackstar :: SRC/GIT/dhcpcd-git ‹master› » sudo /etc/rc.d/rc.networkmanager restart
Stopping NetworkManager: stopped
Starting NetworkManager daemon:  /usr/sbin/NetworkManager
blackstar :: SRC/GIT/dhcpcd-git ‹master› » cat /etc/resolv.conf
# Generated by NetworkManager
search blackstar.local
nameserver 80.67.169.12
nameserver 80.67.169.40
nameserver fd0f:ee:b0::1
maravtdm commented 3 years ago

And after the suspend/resume, it works well too.

maravtdm commented 3 years ago

Here is the /var/log/messages for both : dhcpcd-master

Sep 17 12:14:27 blackstar NetworkManager[4548]: <info>  [1631873667.8985] manager: NetworkManager state is now CONNECTED_LOCAL
Sep 17 12:14:27 blackstar NetworkManager[4548]: <info>  [1631873667.8996] manager: NetworkManager state is now CONNECTED_SITE
Sep 17 12:14:27 blackstar NetworkManager[4548]: <info>  [1631873667.8997] policy: set '01_marav_5' (wlan0) as default for IPv4 routing and DNS
Sep 17 12:14:27 blackstar NetworkManager[4548]: <info>  [1631873667.9044] device (wlan0): Activation: successful, device activated.
Sep 17 12:14:27 blackstar NetworkManager[4548]: <info>  [1631873667.9051] manager: NetworkManager state is now CONNECTED_GLOBAL
Sep 17 12:14:28 blackstar NetworkManager[4548]: <info>  [1631873668.3550] policy: set '01_marav_5' (wlan0) as default for IPv6 routing and DNS

dhcpcd-9

Sep 17 12:17:13 blackstar NetworkManager[6949]: <info>  [1631873833.0012] manager: NetworkManager state is now CONNECTED_LOCAL
Sep 17 12:17:13 blackstar NetworkManager[6949]: <info>  [1631873833.0025] manager: NetworkManager state is now CONNECTED_SITE
Sep 17 12:17:13 blackstar NetworkManager[6949]: <info>  [1631873833.0026] policy: set '01_marav_5' (wlan0) as default for IPv6 routing and DNS
Sep 17 12:17:13 blackstar NetworkManager[6949]: <info>  [1631873833.0084] device (wlan0): Activation: successful, device activated.
Sep 17 12:17:13 blackstar NetworkManager[6949]: <info>  [1631873833.0088] manager: NetworkManager state is now CONNECTED_GLOBAL

this is missing : Sep 17 12:14:27 blackstar NetworkManager[4548]: [1631873667.8997] policy: set '01_marav_5' (wlan0) as default for IPv4 routing and DNS

rsmarples commented 3 years ago

Can you add --debug to the NetworkManager process and restest please? logs will be much more verbose.

maravtdm commented 3 years ago

After resume :

root:log/ # cat /etc/resolv.conf 
# Generated by NetworkManager
search blackstar.local
nameserver fd0f:ee:b0::1
root:log/ # dhcpcd --version
dhcpcd 9.4.0
Copyright (c) 2006-2021 Roy Marples
Compiled in features: INET ARP ARPing IPv4LL INET6 DHCPv6 AUTH PRIVSEP

networkmanager_after_resume.txt

rsmarples commented 3 years ago

Can you repeat but with dhcpcd-master please?

maravtdm commented 3 years ago

After resume (master branch) :

blackstar :: SRC/GIT/dhcpcd-git ‹master› » cat /etc/resolv.conf    
# Generated by NetworkManager
search blackstar.local
nameserver 80.67.169.12
nameserver 80.67.169.40
nameserver fd0f:ee:b0::1
blackstar :: SRC/GIT/dhcpcd-git ‹master› » dhcpcd --version
dhcpcd 9.99.0
Copyright (c) 2006-2021 Roy Marples
Compiled in features: INET ARP ARPing IPv4LL INET6 DHCPv6 AUTH PRIVSEP

networkmanager_after_resume_master.txt

maravtdm commented 3 years ago

FYI, I hard coded this :

blackstar :: ~ » ps faux | grep "[N]etwork"
root     21965  1.2  0.1 287616 16564 ?        Ssl  20:12   0:00 /usr/sbin/NetworkManager --log-level=DEBUG

in my /etc/rc.d/rc.networkmanager

Let me know if you need more tests. I'm available, as much as it takes

maravtdm commented 3 years ago

Strange behaviour with your NetworkManager patch & dhcpcd 9.4.0 (stable release) https://gitlab.freedesktop.org/rsmarples/NetworkManager/-/tree/9cc2591fe871b29ce9a743dfb2cd189cb1580bde

After a suspend/resume (or NM restart), it's like previous dhcpcd processes are stil there

root:~/ # ps faux | grep -e "[d]hcp" -e "[N]etwork" 
root      1643  0.0  0.1 286552 15924 ?        Ssl  20:54   0:00 /usr/sbin/NetworkManager
dhcpcd    2839  0.0  0.0   3004  2564 ?        S    20:56   0:00  \_ dhcpcd: wlan0 [ip4]
root      2840  0.0  0.0   3008  1996 ?        S    20:56   0:00      \_ dhcpcd: [privileged actioneer] wlan0 [ip4]
dhcpcd    2865  0.0  0.0   3008   304 ?        S    20:56   0:00      |   \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd    2871  0.0  0.0   3008   304 ?        S    20:56   0:00      |   \_ dhcpcd: [network proxy] 192.168.111.15
dhcpcd    2841  0.0  0.0   2996   292 ?        S    20:56   0:00      \_ dhcpcd: [control proxy] wlan0 [ip4]
root      2505  0.0  0.0   3008  2004 ?        S    20:55   0:00 dhcpcd: [privileged actioneer] wlan0 [ip4]
dhcpcd    2513  0.0  0.0   3008   300 ?        S    20:55   0:00  \_ dhcpcd: [BPF ARP] wlan0 192.168.111.15
dhcpcd    2519  0.0  0.0   3008   300 ?        S    20:55   0:00  \_ dhcpcd: [network proxy] 192.168.111.15

same with dhcpcd-9 no problem with dhcpcd-master

all dhcpcd processes seem to be killed on NM restart but only with your -master release

maravtdm commented 3 years ago

In case, here is our dhcpcd.conf

dhcpcd.conf.txt

maravtdm commented 3 years ago

I added the "debug" option in dhcpcd.conf, just in case I did suspend/resume with dhcpcd-9 and dhcpcd-master, and here are syslog & messages for both hope this helps

For what I see, this message in syslog (dhcpcd-9) does not appear with dhcpcd-master Sep 18 12:16:42 blackstar dhcpcd[19095]: ps_root_recvmsg: Connection refused

messages_9_suspend_resume.txt messages_master_suspend_resume.txt syslog_9_suspend_resume.txt syslog_master_suspend_resume.txt

maravtdm commented 3 years ago

Hi,

A member reported this on the forum :

I tried to bisect between 9.4 and master. The first commit that works as it should, with the addition of

2fae4a113c3e736d585dd300ca6c8fddae300503
DHCP6: Only send FQDN for SOLICIT, REQUEST, RENEW, or REBIND messages.

is

7f6825d3db103bb44cca71aa926c5f5fd9f544d2 
privsep: Fix Linux support for prior

That's 60 patches beyond dhcpcd-9.4.0. I wasn't able to cherry-pick it though...

Maybe, it can be usefull

PJBrs commented 3 years ago

Hi, I thought I'd try and chime in (I'm the forum member that marav is citing above :-) ). I have the same issue, tried to bisect today, but from 9.3.0 to 9.4.0. I downgraded NetworkManager to version 1.28, since the later version in Slackware-current requires dhcpcd to have the --configure option. However, in terms of dhcpcd behavIor, I didn't notice much difference during bisection.

I've found one decisive commit:

[77260559dd3896fca1fc415ba57a01a71aedbc57] dhcpcd: Don't create launcher process if keeping in foreground

With this commit, I get more and more dhcpcd processes after each resume, when using the dhcpcd-9 branch. If I revert it, I get a whole new set of dhcpcd processes after each resume, i.e., everything gets cleaned up nicely. However, I don't get an ip4 nameserver in resolv.conf, nor do I get an ip4 default route. I noticed also that a whole list of "dhcp4 (wlan0): option" messages is missing from the NetworkManager output in /var/log/messages.

I have very limited coding skills, I couldn't take it further than that.

Like marav says, the dhcpcd master branch works.

maravtdm commented 3 years ago

Hi, Some news from our community Someone bisect the commits from 9.3 to 9.4 And he reverts the las commit on the dhcpcd-9 branch ;

diff --git a/src/dhcpcd.c b/src/dhcpcd.c
index 6a4c9723..9b86aa5a 100644
--- a/src/dhcpcd.c
+++ b/src/dhcpcd.c
@@ -2283,9 +2283,6 @@ printpidfile:
                logwarn("freopen stdin");

 #if defined(USE_SIGNALS) && !defined(THERE_IS_NO_FORK)
-       if (!(ctx.options & DHCPCD_DAEMONISE))
-               goto start_manager;
-
        if (xsocketpair(AF_UNIX, SOCK_DGRAM | SOCK_CXNB, 0, fork_fd) == -1 ||
            (ctx.stderr_valid &&
            xsocketpair(AF_UNIX, SOCK_DGRAM | SOCK_CXNB, 0, stderr_fd) == -1))
@@ -2376,9 +2373,8 @@ printpidfile:

        /* We have now forked, setsid, forked once more.
         * From this point on, we are the controlling daemon. */
-       logdebugx("spawned manager process on PID %d", getpid());
-start_manager:
        ctx.options |= DHCPCD_STARTED;
+       logdebugx("spawned manager process on PID %d", getpid());
        if ((pid = pidfile_lock(ctx.pidfile)) != 0) {
                logerr("%s: pidfile_lock %d", __func__, pid);
 #ifdef PRIVSEP
diff --git a/src/privsep.c b/src/privsep.c
index d574a2bc..7da3ce8d 100644
--- a/src/privsep.c
+++ b/src/privsep.c
@@ -172,13 +172,12 @@ ps_dropprivs(struct dhcpcd_ctx *ctx)
 #endif
        }

-#define DHC_NOCHKIO    (DHCPCD_STARTED | DHCPCD_DAEMONISE)
        /* Prohibit writing to files.
         * Obviously this won't work if we are using a logfile
         * or redirecting stderr to a file. */
-       if ((ctx->options & DHC_NOCHKIO) == DHC_NOCHKIO ||
-           (ctx->logfile == NULL &&
-           (!ctx->stderr_valid || isatty(STDERR_FILENO) == 1)))
+       if (ctx->logfile == NULL &&
+           (ctx->options & DHCPCD_STARTED ||
+            !ctx->stderr_valid || isatty(STDERR_FILENO) == 1))
        {
                if (setrlimit(RLIMIT_FSIZE, &rzero) == -1)
                        logerr("setrlimit RLIMIT_FSIZE");

After that, he said : "after resume I don't get all those leftover dhcpcd processes. However, with that default route and nameserver issue"

Sounds good For now, I haven't tested it yet

rsmarples commented 3 years ago

Without that commit though, NetworkManager waitpid(2) will fail which makes the patch I posted above fail with a basic start/stop of the interface through NetworkManager. This is because it will think the launcher process controls dhcpcd which it doesn't.

PJBrs commented 2 years ago

Hi Roy, I'm too unfamiliar with all the technicalities here to properly understand what you're saying... However, indeed without your patch it seems NetworkManager is missing a lot of information (at least, /var/log/messages is missing loads of NM dhcp4 info lines), and IP4 default gateway and nameservers don't get set. It seems I only get an IP4-address and nothing more. (IP6 working without issue with slaac.) However, with your patch the number of dhcpcd processes begins to pile up every time I resume. That was the issue I hoped to resolve by bisecting. (And it did, but not without the above detrimental side effects.) My hope is that this knowledge would give you a better clue of what's going on in the dhcpcd-9 branch :-)

In a related vein, do you have any concrete plans already for doing a release from the master branch?

maravtdm commented 2 years ago

Anyway, I have been using dhcpcd 9.99.0 and your NM patch for 10 days now without any issues.

rsmarples commented 2 years ago

I plan to release dhcpcd-9.4.1 soonish and dhcpcd-10 will follow later. I just don't have the testing capacity I once did so things have slowed down somewhat.

maravtdm commented 2 years ago

Thanks for the update Roy. That's a very good news

I couldn't help a lot on this, but let me know if you need some kind of end-user test

maravtdm commented 2 years ago

After a long time since the last returns, no problem with dhcpcd-9.4.1 So, I close the report

Thx again for all Roy!