danfruehauf / NetworkManager-ssh

SSH VPN integration for NetworkManager
Other
253 stars 40 forks source link

CPU usage 100% #88

Closed AdrianVollmer closed 5 years ago

AdrianVollmer commented 5 years ago

I'm using 1.2.8.-1 on Debian unstable (installed via apt) and the nm-ssh-service is at 100% CPU usage:

$ top | grep nm-ssh-service
 5309 root      20   0  173980   8736   7720 R 100.0   0.1  24:57.41 nm-ssh-service           

Any idea why this could be?

danfruehauf commented 5 years ago

Can you please reproduce it and tell me what it is doing? (Using strace?)

AdrianVollmer commented 5 years ago

Looks like it's polling a whole lot. The only line I see in strace over and over again is this:

poll([{fd=4, events=POLLIN}, {fd=10, events=POLLIN}, {fd=12, events=POLLIN}], 3, -1) = 1 ([{fd=10, revents=POLLHUP}])
AdrianVollmer commented 5 years ago

Is there anything I can do to help? I'd be happy to try to isolate the issue, but I'm not sure where to start.

danfruehauf commented 5 years ago

I think generating a log would be a good start. Otherwise probably attach a debugger to it while it is running, stop it midway (as it is chewing CPU), then see where it is stuck. Are you keen to try to do it?

AdrianVollmer commented 5 years ago

I attached gdb and got this:

(gdb) attach 19647
Attaching to process 19647
[New LWP 19650]
[New LWP 19651]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007fc438ff7b69 in __GI___poll (fds=0x555dcf8c5c90, nfds=3, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:29
29        return SYSCALL_CANCEL (poll, fds, nfds, timeout);
(gdb) where
#0  0x00007fc438ff7b69 in __GI___poll (fds=0x555dcf8c5c90, nfds=3, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007fc439118136 in ?? () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007fc4391184c2 in g_main_loop_run ()
   from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x0000555dcefbf9ed in main (argc=<optimized out>, argv=<optimized out>)
    at nm-ssh-service.c:1608
(gdb) list
24
25      int
26      __poll (struct pollfd *fds, nfds_t nfds, int timeout)
27      {
28      #ifdef __NR_poll
29        return SYSCALL_CANCEL (poll, fds, nfds, timeout);
30      #else
31        struct timespec timeout_ts;
32        struct timespec *timeout_ts_p = NULL;
33

So to me it looks like the main loop isn't doing anything (except for running the default context). Here someone seems to have the same problem, but my understanding is also that the default context shouldn't use 100% cpu unless something is wrong -- which would explain why only I have this problem.

I had a similar problem once in python, where I would call select to see if a file descriptor is ready to write even if I had no data I wanted to write. That also caused a 100% CPU usage but is probably not quite related to this case.

Edit: For completeness, here I'm using the plugin which I compiled myself from the latest Github sources.

danfruehauf commented 5 years ago

I think compile it with -g -O0, it'll give us a much better understanding of what's really happening. That stack trace is rather incomplete...

AdrianVollmer commented 5 years ago

Sure. Doesn't seem that much more verbose though:

(gdb) attach 13860
Attaching to process 13860
[New LWP 13863]
[New LWP 13864]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f33ed147b69 in __GI___poll (fds=0x556a8e7fa670, nfds=3, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:29
29        return SYSCALL_CANCEL (poll, fds, nfds, timeout);
(gdb) where
#0  0x00007f33ed147b69 in __GI___poll (fds=0x556a8e7fa670, nfds=3, timeout=-1)
    at ../sysdeps/unix/sysv/linux/poll.c:29
#1  0x00007f33ed268136 in ?? () from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007f33ed2684c2 in g_main_loop_run ()
   from target:/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x0000556a8d5098ca in main (argc=1, argv=0x7ffd2d9d1ef8) at nm-ssh-service.c:1608
AdrianVollmer commented 5 years ago

Judging from the strace output above, it seems that file descriptor 10 is causing this. I checked in /proc/, and fd 10 is pointing to a pipe:

lr-x------ 1 root root 64 Apr  1 15:27 10 -> pipe:[2499976]

lsof shows the two ends of the pipe:

$ lsof | grep 2499976
nm-ssh-se 15329                         root   10r     FIFO               0,12         0t0    2499976 pipe
nm-ssh-se 15329 15332 gmain             root   10r     FIFO               0,12         0t0    2499976 pipe
nm-ssh-se 15329 15333 gdbus             root   10r     FIFO               0,12         0t0    2499976 pipe
danfruehauf commented 5 years ago

Looks like something I'll have to look into. Thanks for all the information!

Xaenalt commented 5 years ago

I'm hitting this as well, any other info I can grab to help diagnose?

danfruehauf commented 5 years ago

I can reproduce this. I'll see what needs to be done to fix it.

danfruehauf commented 5 years ago

@lhw I tagged the version that fixes that as 1.2.10. I recommend deploying it as this is a severe bug.

AdrianVollmer commented 5 years ago

Thanks for looking into this. However, I'm afraid I'm still getting 100% cpu usage. Maybe I'm doing something wrong. @Xaenalt does the commit fix it for you?

lhw commented 5 years ago

I'll look into fixing that. @AdrianVollmer could you open a bug report on debian as well? So I have a solid reason to get it into debian buster as the next stable would also be affected by this.

danfruehauf commented 5 years ago

Thanks for looking into this. However, I'm afraid I'm still getting 100% cpu usage. Maybe I'm doing something wrong. @Xaenalt does the commit fix it for you?

How did you test it?

AdrianVollmer commented 5 years ago

Basically, git reset --hard ; git pull ; ./configure ; make clean; make ; sudo make install ; sudo service network-manager restart, then connect

@lhw will do

danfruehauf commented 5 years ago

Basically, git reset --hard ; git pull ; ./configure ; make clean; make ; sudo make install ; sudo service network-manager restart, then connect

I'll double check it, but I'm pretty sure it fixed the problem for me. Later I'll pull the built packages for fedora and try again. In the mean time I'll reopen that bug, for the sake of completeness.

@lhw If it doesn't work for @AdrianVollmer we can wait with the release, I'll verify it actually fixes the problem before proceeding.

AdrianVollmer commented 5 years ago

I just deleted the git repo, made double sure everything was gone, and started from scratch.. still there :( are you also on Debian buster/sid?

AdrianVollmer commented 5 years ago

Nevermind, I forgot to use --libdir=/usr/lib! Default is /usr/local/lib. It's fixed! Now I can finally run it permanently :) thanks again!

danfruehauf commented 5 years ago

Nevermind, I forgot to use --libdir=/usr/lib! Default is /usr/local/lib. It's fixed! Now I can finally run it permanently :) thanks again!

No no, thank you! Glad we got this one sorted and thanks for your time.

@lhw FYI you can release with 1.2.10, it is indeed fixed.

lhw commented 5 years ago

Updated package is in debian unstable. Now I have to find out how to get it into testing for buster

Xaenalt commented 5 years ago

Can confirm this works in Fedora 29, needed to move /usr/local/{lib, libexec} to /usr/local and everything works perfectly, thanks!

danfruehauf commented 5 years ago

@Xaenalt You welcome. It's already in f28-testing, f29-testing and also in rawhide. You should see it as an update soon. Alternatively you can install it from the testing repository.