BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
2.03k stars 448 forks source link

BOINC crashes when gdm is up or shutdown on fedora 29 (or maybe whatever) with nvidia driver #2798

Open ghost opened 6 years ago

ghost commented 6 years ago

This is a double post of Google groups, but I guess this is the right place. So I post the same thing here.

This may be a rare case, but I need help.

I am using BOINC on Fedora 29, with two nvidia cards (1070/750ti) and Intel integrated graphics using OpenCL of nvida cards on my core i7 4790k rig. For BOINC, I use both nvidia cards' OpenCL for Collatz Conjecture and cpu for Citizen Science Grid/Amicable number. My monitor is connected to cpu's integrated graphics, and nvidia cards are used only for BOINC so that x window manipulation doesn't slow down boinc calculation.

To build this configuration, I installed like this:

  1. Install Fedora with gnome as usual.
  2. Install boinc and nvidia graphic drivers which includes kernel modules and libraries for cuda/OpenCL and glx driver(and more?).
  3. Only with above configuration, when loggin on/off, boinc crashes, so I removed /lib/modules/(kernel version)/kernel/drivers/video/nvidia-drm.ko which is one of four modules installed by nvidia graphics driver. Then reinstall x server again, because nvidia graphics driver replaces libglx.so with one for nvidia cards (symlink to nvidia's glx driver). This libglx.so replacement was found in /var/log/Xorg.0.log. libglx.so is in /usr/lib64/xorg/modules/extensions/libglx.so.

Then I got my system up and running as long as gdm is on, but if gdm is shut down with "systemctl isolate multi-user" to get into console mode, boinc crashes. And before that, I have to start boinc after logging into x, because when gdm starts, boinc crashes.

I suspect x server is doing something wrong to boinc client, but looking at x server source I can't solve this. What does boinc client see when x server is up or shut down?

Does anyone help?

Thanks in advance!!

-Tetsuji

Germano0 commented 5 years ago

Are you sure you are using GDM? It has been replaced with SDDM on Fedora some years ago.

ghost commented 5 years ago

Yes, specifically gdm3.

On 12/6/18 1:24 AM, Germano Massullo wrote:

Are you sure you are using GDM? It has been replaced with SDDM

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2798#issuecomment-444547338, or mute the thread https://github.com/notifications/unsubscribe-auth/Abf_4linzK7gj8oco2z4XPzSSDiNralPks5u1_MvgaJpZM4YOd_P.

ghost commented 5 years ago

But I'm not sure it's gdm which crashes boinc client. Xorg may be the problem. As it uses systemd, when starting gdm, I use "systemctl isolate graphical.target" and then gdm and one Xorg server starts. That's when boinc crashes. When logging into an account, another Xorg instance starts, but it doesn't affect boinc client at all. (when the user logs out, this second Xorg stops, but boinc client is still alive.) I don't know why, but the first Xorg may be the problem.

Returning to the console mode with "systemctl isolate multi-user.target", both Xorg's go down (if no user is logging in, one Xorg) and boinc client crashes.

I don't know why boinc crashes, but it actually happens and is annoying. Will you let me know how boinc can spit a log upon crashing so that I can tell the reason?

JuhaSointusalo commented 5 years ago

@maverick6664

Do you install BOINC from Fedora repository or is it self-built?

ghost commented 5 years ago

It's a bit complicated. I installed it as a package of Fedora. And I overwrote self-built boinc and boinc_client in /usr/bin. But basically it happened with original binary.

ghost commented 5 years ago

Now I feel like this is not a problem of boinc, but of systemd. On another Fedora 29 notebook, "systemctl isolate multi-user" shuts down wpa_supplicant as well as gdm. On this machine, boinc-client (which doesn't attach to any projects) keeps running.

Germano0 commented 5 years ago

BOINC client works fine with Fedora package and nVidia proprietary drivers from RPMFusion. Me, as Fedora package comaintainer, cannot provide support for handmade hacks

ghost commented 5 years ago

I confirmed my problem after reinstalling boinc-client package.

But this should be a problem of systemd, since a similar problem happens with wpa_supplicant on another machine. Transition from multi-user to graphical or vice versa seems to kill a service. On one of my machine, it is boinc-client which is killed, and on another machine, it's wpa_supplicant.

Anyway, here I put the command series what happens with boinc-client service. (I have disabled boinc-client.service now) In graphical.target mode:

$ systemctl status boinc-client

● boinc-client.service - Berkeley Open Infrastructure Network Computing Client Loaded: loaded (/usr/lib/systemd/system/boinc-client.service; disabled; vend> Active: active (running) since Thu 2018-12-06 06:36:34 JST; 32min ago Docs: man:boinc(1) Main PID: 12309 (boinc) Tasks: 50 (limit: 4915) Memory: 3.9G CGroup: /system.slice/boinc-client.service ├─12309 /usr/bin/boinc ├─12376 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─12377 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─12378 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─12379 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─12380 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─12382 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─12385 ../../projects/csgrid.org_csg/exact_client_0.33 --training_f> ├─14535 ../../projects/boinc.thesonntags.com_collatz/collatz_sieve_1> └─15523 ../../projects/boinc.thesonntags.com_collatz/collatz_sieve_1>

Dec 06 07:07:29 maverick boinc[12309]: No protocol specified Dec 06 07:07:29 maverick boinc[12309]: No protocol specified Dec 06 07:07:30 maverick boinc[12309]: No protocol specified Dec 06 07:07:30 maverick boinc[12309]: No protocol specified Dec 06 07:07:31 maverick boinc[12309]: No protocol specified Dec 06 07:07:31 maverick boinc[12309]: No protocol specified Dec 06 07:07:42 maverick boinc[12309]: 06-Dec-2018 07:07:42 [Citizen Science Gr> Dec 06 07:07:42 maverick boinc[12309]: 06-Dec-2018 07:07:42 [Citizen Science Gr> Dec 06 07:07:45 maverick boinc[12309]: 06-Dec-2018 07:07:45 [Citizen Science Gr> Dec 06 07:07:45 maverick boinc[12309]: 06-Dec-2018 07:07:45 [Citizen Science Gr>

after "systemctl isolate multi-user" as root

systemctl status boinc-client

● boinc-client.service - Berkeley Open Infrastructure Network Computing Client Loaded: loaded (/usr/lib/systemd/system/boinc-client.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:boinc(1)

Dec 06 07:07:31 maverick boinc[12309]: No protocol specified Dec 06 07:07:31 maverick boinc[12309]: No protocol specified Dec 06 07:07:42 maverick boinc[12309]: 06-Dec-2018 07:07:42 [Citizen Science Grid] Sending scheduler request: To fetch work. Dec 06 07:07:42 maverick boinc[12309]: 06-Dec-2018 07:07:42 [Citizen Science Grid] Requesting new tasks for CPU Dec 06 07:07:45 maverick boinc[12309]: 06-Dec-2018 07:07:45 [Citizen Science Grid] Scheduler request completed: got 0 new tasks Dec 06 07:07:45 maverick boinc[12309]: 06-Dec-2018 07:07:45 [Citizen Science Grid] Project has no tasks available Dec 06 07:10:08 maverick systemd[1]: Stopping Berkeley Open Infrastructure Network Computing Client... Dec 06 07:10:08 maverick boinc[12309]: 06-Dec-2018 07:10:08 [---] Received signal 15 Dec 06 07:10:08 maverick boinc[12309]: 06-Dec-2018 07:10:08 [---] Exiting Dec 06 07:10:15 maverick systemd[1]: Stopped Berkeley Open Infrastructure Network Computing Client.

then, start boinc-client by "systemctl start boinc-client"

● boinc-client.service - Berkeley Open Infrastructure Network Computing Client Loaded: loaded (/usr/lib/systemd/system/boinc-client.service; disabled; vendor preset: disabled) Active: active (running) since Thu 2018-12-06 07:11:51 JST; 3s ago Docs: man:boinc(1) Main PID: 16861 (boinc) Tasks: 22 (limit: 4915) Memory: 1.4G CGroup: /system.slice/boinc-client.service ├—16861 /usr/bin/boinc ├—16929 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16930 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16931 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16932 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16934 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16935 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16938 ../../projects/csgrid.org_csg/exact_client_0.33 --training_file training_samples.bin --validation_file validation_samples.bin --testing_file testing_samples.bin> ├—16940 ../../projects/boinc.thesonntags.com_collatz/collatz_sieve_1.40_x86_64-pc-linux-gnuopencl_nvidia --device 0 └—16942 ../../projects/boinc.thesonntags.com_collatz/collatz_sieve_1.40_x86_64-pc-linux-gnuopencl_nvidia --device 1

Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] max disk usage: 85.84 GB Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] (to change preferences, visit a project web site or select Preferences in the Manager) Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] Setting up project and slot directories Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] Checking active tasks Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] Setting up GUI RPC socket Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] gui_rpc_auth.cfg is empty - no GUI RPC password protection Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] Checking presence of 2069 project files Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 Initialization completed Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [Citizen Science Grid] Sending scheduler request: To fetch work. Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [Citizen Science Grid] Requesting new tasks for CPU

again, get into graphical mode

[root@maverick ~]# systemctl isolate graphical [root@maverick ~]# systemctl status boinc-client ● boinc-client.service - Berkeley Open Infrastructure Network Computing Client Loaded: loaded (/usr/lib/systemd/system/boinc-client.service; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:boinc(1)

Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [---] Checking presence of 2069 project files Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 Initialization completed Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [Citizen Science Grid] Sending scheduler request: To fetch work. Dec 06 07:11:53 maverick boinc[16861]: 06-Dec-2018 07:11:53 [Citizen Science Grid] Requesting new tasks for CPU Dec 06 07:11:56 maverick boinc[16861]: 06-Dec-2018 07:11:56 [Citizen Science Grid] Scheduler request completed: got 0 new tasks Dec 06 07:11:56 maverick boinc[16861]: 06-Dec-2018 07:11:56 [Citizen Science Grid] Project has no tasks available Dec 06 07:13:07 maverick systemd[1]: Stopping Berkeley Open Infrastructure Network Computing Client... Dec 06 07:13:07 maverick boinc[16861]: 06-Dec-2018 07:13:07 [---] Received signal 15 Dec 06 07:13:08 maverick boinc[16861]: 06-Dec-2018 07:13:08 [---] Exiting Dec 06 07:13:14 maverick systemd[1]: Stopped Berkeley Open Infrastructure Network Computing Client.

As seen above, boinc-client.service is killed after "systemctl isolate graphical" or "systemctl isolate multi-user"