BOINC may not use all CPUs in some cases

sirzooro commented 7 years ago

I am cleaning up my work queue before next PrimeGrid challenge, and found case when BOINC Client does not run tasks on all available cores. Now I have 3 rosetta@home tasks running on 3 out of 8 available CPUs. There are also some ATLAS@Home and Cosmology@Home tasks waiting, but they require 7 or 8 CPUs per WU. Most projects are now set to not download new tasks, except for one with zero resource usage set. It looks that BOINC only checks if there are some other tasks available in the queue and do not try to download new ones from project with zero resource usage set when there are some downloaded tasks waiting. This is wrong, it should also check required CPU count for them and compare it with current free CPU count to eliminate cases like this.

I suspect that other similar cases may also exists, e.g. when some tasks are waiting but there is not enough memory to run them, please take a look on them too.

Windows 10 64bit, BOINC 7.6.33

Edit: there is one more case. I suspended rosetta project and BOINC started crunching one Cosmology WU. It finished it and started ATLAS WU. It required more memory so it stopped working (status is Waiting for memory). Now BOINC does not use any CPU (except for small fraction reserved for GPU and NCI tasks), even if there are other Cosmology tasks ready to start.

sirzooro commented 7 years ago

There are also two cases when GPU also may not have work. I am not sure if they should go to this issue, but they looks related:

on systems with multiple GPUs some of them may not get work if all CPUs are busy. Details and logs from someone with 4 Titans are here. I also had similar problem with my 2 GPUs, and fixed it in the same way - created app_config for GPU apps to reduce requires CPU to small value like 0.01;-
similar problem also exists with GPU tasks which needs multiple GPUs. Moo! Wrapper projects sends such WUs, it sent me ones which needed both of my 2 GPU. For some reason presence of such tasks in work queue also was a problem for scheduler, sometimes it also assigned work for only 1 GPU. All other GPU apps were configured to use small fractional CPU part, so it looks like something related to these Moo! Wrapper tasks. When I finished crunching all downloaded WUs, BOINC started working as expected again.

This was observed on previous Windows BOINC version (do not remember exactly - 7.6.23?). I did not try to reproduce it on current version.

davidpanderson commented 7 years ago

If possible, see if you can reproduce scheduling problems on the BOINC Client Emulator: http://boinc.berkeley.edu/dev/sim_web.php This makes it 100x easier for me to fix them.

-- David

On 1/30/2017 9:47 PM, sirzooro wrote:

There are also two cases when GPU also may not have work. I am not sure if they should go to this issue, but they looks related:

on systems with multiple GPUs some of them may not get work if all CPUs are busy. Details and logs from someone with 4 Titans are here https://boinc.berkeley.edu/dev/forum_thread.php?id=10746. I also had similar problem with my 2 GPUs, and fixed it in the same way - created app_config for GPU apps to reduce requires CPU to small value like 0.01;-

similar problem also exists with GPU tasks which needs multiple GPUs. Moo! Wrapper projects sends such WUs, it sent me ones which needed both of my 2 GPU. For some reason presence of such tasks in work queue also was a problem for scheduler, sometimes it also assigned work for only 1 GPU. All other GPU apps were configured to use small fractional CPU part, so it looks like something related to these Moo! Wrapper tasks. When I finished crunching all downloaded WUs, BOINC started working as expected again.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/1775#issuecomment-276280806, or mute the thread https://github.com/notifications/unsubscribe-auth/AA8KgWSPEOjrUJ2BgRTGHl5ucqvCqr8sks5rXssNgaJpZM4Lwvfr.

sirzooro commented 7 years ago

Thanks for link. I will try to play with it a bit.

sorcrosc commented 7 years ago

This also happens when for one project is used option in app_config.xml to limit the number of tasks to run simultaneously . If BOINC has plenty of workunits for such project, it doesn't request more work from others and some cores remain dry

sirzooro commented 7 years ago

One more issue, just reported on WUProp forum:

Just in case anyone encounters the same issue. A couple of inactive NCI projects prevented me from getting any work for any hardware on one system today. BM (7.6.33 [x64]) event log: Not requesting tasks: don't need (CPU: not highest priority project; Miner ASIC: not highest priority project; NVIDIA GPU: not highest priority project)

Had run out of Asic & GPU work and was about to run out of CPU work (only 2/7 logical cores being used). BM just kept asking for nci work (PoD style) & ignored the other projects/devices completely. Serious scheduler bug IMO + stupid error message (CPU isn't a project, even if the code deludes itself into thinking otherwise).

Toby-Broom commented 7 years ago

I see the same as sorosc on LHC, as they have a job limit of 24. If I set on this project to unlimited for the Sixtrack app then it will queue based on the cache settings of BOINC, if I set to 24 then it queues no task it just runs upto that limit and when one task is finished it gets another.

sirzooro commented 7 years ago

One more case (maybe duplicate of some already mentioned one): DENIS performs some maintenance work now and it sends WUs, but input files cannot be downloaded so WUs ends with "download error". This somehow prevents downloading WUs from Asteroids - my backup project. I saw this in log when I tried to manually update project to download new WUs:

300324 Asteroids@home 2017-06-23 07:24:28 Sending scheduler request: Requested by user.
300325 Asteroids@home 2017-06-23 07:24:28 Not requesting tasks: don't need (not highest priority project)

Looks that these faulty DENIS WUs prevented downloads of other ones from backup project. I had 16 of them in the queue. Remaining 16 CPUs were getting WUs from Asteroids as expected. This was on BOINC 7.6.22 for Linux.

sirzooro commented 7 years ago

One one case, this one is interesting. I am crunching "GFN-13 Prime Search" from "PRIVATE GFN SERVER" (run by stream, https://www.primegrid.com/forum_thread.php?id=6511). One of results for completed WU could not be uploaded, and somehow it prevented downloading of new WUs from this project - BOINC client switched to backup project. This is what I found in log:

225112  PRIVATE GFN SERVER  2017-07-27 17:32:53 Requesting new tasks for CPU    
225113  PRIVATE GFN SERVER  2017-07-27 17:32:59 Scheduler request completed: got 0 new tasks    
225114  PRIVATE GFN SERVER  2017-07-27 17:32:59 Result gfn13_72132256_1499672386_1 is no longer usable  
225115  PRIVATE GFN SERVER  2017-07-27 17:32:59 No tasks sent

I have aborted this upload and requested project update. After doing this new WUs were downloaded without problem:

226443  PRIVATE GFN SERVER  2017-07-27 20:04:54 update requested by user    
226444  PRIVATE GFN SERVER  2017-07-27 20:04:56 Sending scheduler request: Requested by user.   
226445  PRIVATE GFN SERVER  2017-07-27 20:04:56 Reporting 1 completed tasks 
226446  PRIVATE GFN SERVER  2017-07-27 20:04:56 Requesting new tasks for CPU    
226447  PRIVATE GFN SERVER  2017-07-27 20:05:01 Scheduler request completed: got 15 new tasks

I am not sure if this is problem with client or server, it may be on either side.

Toby-Broom commented 7 years ago

Another example here is if a task goes to the state VM unmanagble it depletes the queues tasks and just sits there with 1 bad task till you abort the it reloads n tasks

sirzooro commented 7 years ago

And next one: I configured one project via app_config.xml to use 22 out of 32 cores. Remaining 10 were left for another project with very short tasks. That 2nd project also has very limited WU supply, so BOINC was not able to build buffer for it. As a result BOINC kept downloading tasks from 1st project until it filled work queue. At this point it stopped trying to download tasks from 2nd project because queue was full, so 10 cores reserved for it were idle.

davidpanderson commented 7 years ago

Can you reproduce this on the client emulator? https://boinc.berkeley.edu/dev/sim_web.php That makes it easier for me to fix the problem.

Toby-Broom commented 7 years ago

My PC became VM unmanagable, here is sim with the required files, I didn't look to see if the SIM was blocked? https://boinc.berkeley.edu/dev/sim_web.php?action=show_simulation&scen=154&sim=0

Here is one with 24 job limit https://boinc.berkeley.edu/dev/sim_web.php?action=simulation_form&scen=155

BOINC / boinc

BOINC may not use all CPUs in some cases #1775