BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.99k stars 443 forks source link

Waiting while loading tasks into memory may need another informational message #3937

Open Ageless93 opened 4 years ago

Ageless93 commented 4 years ago

I'm running Rosetta tasks on my AMD Ryzen 3900X 12 core CPU, 24 threads. I've got Use at most N% of the CPUs set to 99%, so I leave one core or thread free for Windows and the GPU. I've got 23 Rosetta tasks of various research, and with widely spread memory use. The smallest is 390MB the largest 715MB.

It takes several seconds for all Rosetta tasks to load into memory and during this time the manager is unresponsive. I have seen this behaviour before, it can take over a minute before the manager populates its windows.

During this time a "communicating with BOINC client, please wait" window sits on top of BOINC Manager, with "Exit BOINC Manager" and "Cancel" buttons. The Cancel button doesn't do anything. It'll just close that window and reopen it. Hitting "Exit BOINC Manager" at this point will exit the manager and client but leave all tasks running. They don't seem to get the boinc_exit() signal. Or ignore it.

20/07/2020 14:15:10 | | Starting BOINC client version 7.16.7 for windows_x86_64 20/07/2020 14:15:10 | | log flags: file_xfer, sched_ops, task, checkpoint_debug, coproc_debug, cpu_sched 20/07/2020 14:15:10 | | log flags: sched_op_debug 20/07/2020 14:15:10 | | Libraries: libcurl/7.47.1 OpenSSL/1.0.2s zlib/1.2.8 20/07/2020 14:15:10 | | Data directory: G:\BOINC 20/07/2020 14:15:10 | | Running under account elst9 20/07/2020 14:15:10 | | [coproc] launching child process at C:\Program Files\BOINC\boinc.exe 20/07/2020 14:15:10 | | [coproc] with data directory "G:\BOINC" 20/07/2020 14:15:11 | | OpenCL: AMD/ATI GPU 0: AMD Radeon RX 5700 XT (driver version 3075.12 (PAL,LC), device version OpenCL 2.0 AMD-APP (3075.12), 8176MB, 8176MB available, 4646 GFLOPS peak) 20/07/2020 14:15:11 | | [coproc] No NVIDIA library found 20/07/2020 14:15:11 | | [coproc] No ATI library found. 20/07/2020 14:15:11 | SETI@home | Found app_info.xml; using anonymous platform 20/07/2020 14:15:11 | | Windows processor group 0: 24 processors 20/07/2020 14:15:11 | | Host name: Ryzen 20/07/2020 14:15:11 | | Processor: 24 AuthenticAMD AMD Ryzen 9 3900X 12-Core Processor [Family 23 Model 113 Stepping 0] 20/07/2020 14:15:11 | | Processor features: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 htt pni ssse3 fma cx16 sse4_1 sse4_2 movebe popcnt aes f16c rdrandsyscall nx lm avx avx2 svm sse4a osvw ibs skinit wdt tce topx page1gb rdtscp fsgsbase bmi1 smep bmi2 20/07/2020 14:15:11 | | OS: Microsoft Windows 10: Professional x64 Edition, (10.00.19041.00) 20/07/2020 14:15:11 | | Memory: 31.93 GB physical, 36.68 GB virtual 20/07/2020 14:15:11 | | Disk: 3.64 TB total, 2.52 TB free 20/07/2020 14:15:11 | | Local time is UTC +2 hours 20/07/2020 14:15:11 | | No WSL found. 20/07/2020 14:15:11 | | General prefs: from http://boinc.bakerlab.org/rosetta/ (last modified 07-Jun-2020 22:36:32) 20/07/2020 14:15:11 | | Host location: none 20/07/2020 14:15:11 | | General prefs: using your defaults 20/07/2020 14:15:11 | | Reading preferences override file 20/07/2020 14:15:11 | | Preferences: 20/07/2020 14:15:11 | | max memory usage when active: 29423.95 MB 20/07/2020 14:15:11 | | max memory usage when idle: 3269.33 MB 20/07/2020 14:15:11 | | max disk usage: 50.00 GB 20/07/2020 14:15:11 | | max CPUs used: 23 20/07/2020 14:15:11 | | max download rate: 1024000 bytes/sec 20/07/2020 14:15:11 | | max upload rate: 1024000 bytes/sec 20/07/2020 14:15:11 | | (to change preferences, visit a project web site or select Preferences in the Manager) 20/07/2020 14:15:11 | | Setting up project and slot directories 20/07/2020 14:15:11 | | Checking active tasks 20/07/2020 14:15:11 | WUProp@Home | Task data_collect_v4_1592371802_368919_1 is 4.90 days overdue; you may not get credit for it. Consider aborting it. 20/07/2020 14:15:11 | collatz | URL https://boinc.thesonntags.com/collatz/; Computer ID 864695; resource share 800 20/07/2020 14:15:11 | collatz | Your settings do not allow fetching tasks for CPU. To fix this, you can change Project Preferences on the project's web site. 20/07/2020 14:15:11 | Einstein@Home | URL http://einstein.phys.uwm.edu/; Computer ID 12812132; resource share 800 20/07/2020 14:15:11 | Einstein@Home | Your settings do not allow fetching tasks for CPU. To fix this, you can change Project Preferences on the project's web site. 20/07/2020 14:15:11 | Milkyway@Home | URL http://milkyway.cs.rpi.edu/milkyway/; Computer ID 839275; resource share 800 20/07/2020 14:15:11 | Milkyway@Home | Your settings do not allow fetching tasks for CPU. To fix this, you can change Project Preferences on the project's web site. 20/07/2020 14:15:11 | minecrafthome | URL https://minecraftathome.com/minecrafthome/; Computer ID 332; resource share 800 20/07/2020 14:15:11 | Moo! Wrapper | URL http://moowrap.net/; Computer ID 1295238; resource share 800 20/07/2020 14:15:11 | Moo! Wrapper | Your settings do not allow fetching tasks for CPU. To fix this, you can change Project Preferences on the project's web site. 20/07/2020 14:15:11 | Rosetta@home | URL https://boinc.bakerlab.org/rosetta/; Computer ID 3928095; resource share 800 20/07/2020 14:15:11 | SETI@home | URL http://setiathome.berkeley.edu/; Computer ID 8882875; resource share 999 20/07/2020 14:15:11 | SETI@home Beta Test | URL http://setiweb.ssl.berkeley.edu/beta/; Computer ID 89104; resource share 1000 20/07/2020 14:15:11 | SETI@home Beta Test | Your settings do not allow fetching tasks for CPU. To fix this, you can change Project Preferences on the project's web site. 20/07/2020 14:15:11 | Universe@Home | URL https://universeathome.pl/universe/; Computer ID 548638; resource share 800 20/07/2020 14:15:11 | WUProp@Home | URL https://wuprop.boinc-af.org/; Computer ID 161551; resource share 100 20/07/2020 14:15:11 | | Setting up GUI RPC socket 20/07/2020 14:15:11 | | Checking presence of 735 project files 20/07/2020 14:15:11 | WUProp@Home | [cpu_sched] Restarting task data_collect_v4_1592371802_368919_1 using data_collect_v4 version 420 (nci) in slot 0 20/07/2020 14:15:12 | Rosetta@home | [cpu_sched] Restarting task b4k_8814_fold_SAVE_ALL_OUT_956605_919_0 using rosetta version 420 in slot 6 20/07/2020 14:15:14 | Rosetta@home | [cpu_sched] Restarting task b4k_5283_fold_SAVE_ALL_OUT_958064_919_0 using rosetta version 420 in slot 7 20/07/2020 14:15:16 | Rosetta@home | [cpu_sched] Restarting task b3x_7312_fold_SAVE_ALL_OUT_957241_919_0 using rosetta version 420 in slot 10 20/07/2020 14:15:17 | Rosetta@home | [cpu_sched] Restarting task b3x_1505_fold_SAVE_ALL_OUT_956729_919_0 using rosetta version 420 in slot 1 20/07/2020 14:15:19 | Rosetta@home | [cpu_sched] Restarting task b4d_2271_fold_SAVE_ALL_OUT_954585_30_0 using rosetta version 420 in slot 13 20/07/2020 14:15:20 | Rosetta@home | [cpu_sched] Restarting task bmpr1_attempt3_5_SAVE_ALL_OUT_IGNORE_THE_REST_3gr3hd8t_951626_2_0 using rosetta version 420 in slot 14 20/07/2020 14:15:22 | Rosetta@home | [cpu_sched] Restarting task bmpr1_attempt3_1_SAVE_ALL_OUT_IGNORE_THE_REST_5ys1iz7k_951182_2_0 using rosetta version 420 in slot 15 20/07/2020 14:15:24 | Rosetta@home | [cpu_sched] Restarting task b3x_4810_fold_SAVE_ALL_OUT_956985_919_0 using rosetta version 420 in slot 16 20/07/2020 14:15:25 | Rosetta@home | [cpu_sched] Restarting task JHR_b2_03614_n_full_17_0000100003_0000011_0_fragments_fold_SAVE_ALL_OUT_965143_24_0 using rosetta version 420 in slot 17 20/07/2020 14:15:27 | Rosetta@home | [cpu_sched] Restarting task bmpr2_attempt3_0_SAVE_ALL_OUT_IGNORE_THE_REST_5jn2ih1o_951189_2_0 using rosetta version 420 in slot 18 20/07/2020 14:15:29 | Rosetta@home | [cpu_sched] Restarting task bmpr2_attempt3_9_SAVE_ALL_OUT_IGNORE_THE_REST_4eu5sp8y_951666_2_0 using rosetta version 420 in slot 19 20/07/2020 14:15:30 | Rosetta@home | [cpu_sched] Restarting task b3x_7287_fold_SAVE_ALL_OUT_957239_919_0 using rosetta version 420 in slot 20 20/07/2020 14:15:32 | Rosetta@home | [cpu_sched] Restarting task b3x_1638_fold_SAVE_ALL_OUT_956733_919_0 using rosetta version 420 in slot 21 20/07/2020 14:15:34 | Rosetta@home | [cpu_sched] Restarting task rb_07_18_31941_31639t0000_C1_SAVE_ALL_OUT_IGNORE_THE_REST_1002521_211_0 using rosetta version 420 in slot 22 20/07/2020 14:15:36 | Rosetta@home | [cpu_sched] Restarting task b3x_4758_fold_SAVE_ALL_OUT_956974_919_0 using rosetta version 420 in slot 23 20/07/2020 14:15:38 | Rosetta@home | [cpu_sched] Restarting task JHR_bd4_02142_n_0000100001_0000021_0_fragments_fold_SAVE_ALL_OUT_971108_2_0 using rosetta version 420 in slot 11 20/07/2020 14:15:39 | Rosetta@home | [cpu_sched] Restarting task tgfbR2_3_SAVE_ALL_OUT_IGNORE_THE_REST_8cs3zr6o_958361_1_0 using rosetta version 420 in slot 9 20/07/2020 14:15:41 | Rosetta@home | [cpu_sched] Restarting task b4k_20702_fold_SAVE_ALL_OUT_957940_919_0 using rosetta version 420 in slot 12 20/07/2020 14:15:43 | Rosetta@home | [cpu_sched] Restarting task bmpr1_attempt3_4_SAVE_ALL_OUT_IGNORE_THE_REST_8tm0cb3e_951620_2_0 using rosetta version 420 in slot 3

Ageless93 commented 4 years ago

What I want to propose is that instead of the "Communicating with BOINC client, please wait" we put "Loading tasks into memory, please wait" in a window here. Because that is what's happening. Now people may think something's hung. I thought for a while something's hung, it didn't register that 14GB of memory was being loaded and that that cannot be done instantaneously.

RichardHaselgrove commented 4 years ago

Fair comment, but I think we need to step carefully and thoughtfully here. "Communicating with BOINC client" is factually correct - although perhaps "Awaiting reply from BOINC client" is closer to reality.

"Loading tasks into memory" might well be a valid reason, but is it true in this case? Might there be other reasons for slow client initialisation - parsing an exceptionally complex set of attached projects, perhaps, or verifying a huge number of project files on slow storage? We should avoid using definitive statements as to causes, before enumerating and eliminating every possible alternative explanation.

Ageless93 commented 4 years ago

"Loading tasks into memory" might well be a valid reason, but is it true in this case? Might there be other reasons for slow client initialisation - parsing an exceptionally complex set of attached projects, perhaps, or verifying a huge number of project files on slow storage? We should avoid using definitive statements as to causes, before enumerating and eliminating every possible alternative explanation.

It's to simplify what's happening. BOINC is coming on to being 18 years in development and still a lot of people do not know there are separate parts to the program. How many people post just about BOINC Manager, because that's the only thing they see? They don't know, many don't care that there is a client running as well.

In my case it is true that the loading of the data is slow because it comes from a 6TB hard drive which has a raw read speed of 122MB/sec. Perhaps BOINC should index files in its data directory. But just checking a Windows Task Manager will show that both boinc.exe and boincmgr.exe are running and have been for some time, so their slow communication between themselves should get a better explanation. Of course exiting BOINC Manager doesn't necessarily exit the client, but when it does it should also always exit the running tasks. And not leave lots in memory and in a running state.

AenBleidd commented 4 years ago

@Ageless93,

What I want to propose is that instead of the "Communicating with BOINC client, please wait" we put "Loading tasks into memory, please wait" in a window here.

I'm not sure it's technically possible to determine such state from Manager without significant architecture changes

Perhaps BOINC should index files in its data directory.

I'm not sure indexing could ever help because I don't know how could we speed-up reading from hard-drive that is actually handled by OS.

RichardHaselgrove commented 4 years ago

Of course exiting BOINC Manager doesn't necessarily exit the client, but when it does it should also always exit the running tasks.

That bit I certainly concur with - interrupting/cancelling the client initialisation process should always revert the consequential project initialisations. Having said, project applications should self-close if they discover they are running boinc-less. Has Rosetta implemented the API calls consequent on https://boinc.berkeley.edu/trac/wiki/AppIntro correctly?

Ageless93 commented 4 years ago

I just turned SMT off, so only have to load 11 tasks. Still takes 39 seconds from client start to fully loaded. I get it that it may be difficult or impossible to determine the state of what's happening without rewriting the architecture. Perhaps for the 20th anniversary.

And while indexing may not help, we're now checking the presence of a lot of project files in the directory. Depending on how many projects someone has added, this could be substantial. What is the checking BOINC does though? Just count the files, prod them, test their data sanity? What does "Checking presence of 783 project files" do? And how long will that take?

I understand we'll always be hampered by the speed of the slowest bit of the hardware. But I still think that putting a more user friendly message down goes a long way towards them patiently waiting until things have loaded. Because didn't we want to make BOINC more user friendly, with simpler messages?

RichardHaselgrove commented 4 years ago

IMHO, the guiding principle should be that, above all else, a message should be accurate.

I much prefer an accurate, but vague, 'BOINC is waiting for an answer' to a precise but false "BOINC is loading projects". Unless you know for certain what is holding it up this time.

Ageless93 commented 4 years ago

I wasn't saying to make it "BOINC is loading projects" as that's as vague as "Communicating with BOINC client". How about another message, like "Finalizing initialization, please wait"?

CharlieFenton commented 4 years ago

During this time a "communicating with BOINC client, please wait" window sits on top of BOINC Manager, with "Exit BOINC Manager" and "Cancel" buttons. The Cancel button doesn't do anything. It'll just close that window and reopen it. Hitting "Exit BOINC Manager" at this point will exit the manager and client but leave all tasks running. They don't seem to get the boinc_exit() signal. Or ignore it.

The Manager has code to issue RPCs asynchronously, but the client does not. So when the client is busy, it can't respond to an RPC. The Manager can issue some RPCs and continue with other work while waiting for a reply, and it just adds these to a queue as needed. But it has to wait for a response for some RPCs before it can continue; these trigger the "Communicating with client" dialog if no response is received after a certain delay (1.5 seconds in most cases.)

The "Cancel" button only cancels the one RPC which triggered the dialog. The dialog keeps reappearing because the Manager has issued another RPC to the client and is again waiting for a client response. The "Exit BOINC Manager" is an emergency exit as a way out of this loop.

The Manager sends a quit RPC to the client, which normally would then shut down all tasks before exiting as part of its shutdown sequence. The Manager waits 10 seconds for the client to shut itself down, after which it forcefully kills the client. But since the client is busy and non-responsive, it does not act on the quit RPC before it is forcefully killed, so the shutdown sequence never happens, leaving the tasks running.

Ideally, the code used in tasks should check for a dead parent process and exit if the parent has died. I don't remember whether or not that is the case. @davidpanderson should be able to answer that.

davidpanderson commented 4 years ago

BOINC apps (i.e., which use the BOINC API) in theory check every 10 seconds to see if the client has died, and exit if so. However, I sometimes see cases where this doesn't happen.

Ageless93 commented 4 years ago

Well, the Rosetta apps were the new 4.20 ones, I don't know which API they use to build them with. Their server status page doesn't show the server version, only database version 27016. But it's more up to date than they used to for a while there.

So I tested something. I ran BOINC Manager, while checking Task Manager details, waited for some Rosetta apps to show there, then exited BOINC Manager. Waited a minute. All those Rosetta tasks were still in memory. Then I restarted BOINC Manager. BOINC then double loads the tasks already running. Something will then kill off the double processes until just one process stays behind. So where I had 15 Rosetta tasks showing in Task Manager details, I now have only 9, exactly the amount of tasks shown running in BOINC Manager.

Ageless93 commented 4 years ago

Then I restarted BOINC Manager. BOINC then double loads the tasks already running. Something will then kill off the double processes until just one process stays behind. So where I had 15 Rosetta tasks showing in Task Manager details, I now have only 9, exactly the amount of tasks shown running in BOINC Manager.

But for.... drum roll... BOINC runs 4 tasks and the other 5 sit waiting to acquire slot directory lock. Another instance may be running. Had to manually kill those 5.

(Edit: I posted about this on the Rosetta forums)