BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
2.03k stars 449 forks source link

In panic mode, GPUs end up idle #5079

Open hucker75 opened 1 year ago

hucker75 commented 1 year ago

Describe the bug I have a machine with 2 GPUs and a 24 thread CPU. I left it running on just one project (Einstein) for a week while I was away. I come back to find it not using the GPU because it has too much CPU work. Since I've told it to allocate a CPU thread to each GPU task, it couldn't fit them all on the CPU, and I see 24 CPU tasks running "high priority" and no GPU tasks running. Shouldn't the GPU be given priority in this situation as it does a lot more work?

The too much CPU work was not my doing, I set it to a one day buffer and it ended up with 20 days work (a different problem not under discussion here).

Steps To Reproduce

  1. Get too much CPU work
  2. Have the GPU set to allocate a CPU thread to its tasks in app_config.

Expected behavior Run the GPU and let some CPU work not start in time.

System Information

AenBleidd commented 1 year ago

GPU and CPU tasks are treated equally, so I assume it's by design.

RichardHaselgrove commented 1 year ago

'Earliest Deadline First' means exactly what is says on the tin. At Einstein, GPU versions of tasks tend to be given the same deadlines as CPU versions. Without knowing the specific research lines of the tasks involved, I can't comment - but if the earliest deadline belongs to a CPU task, it will be run first.

I have written in the past about over-commitment of CPU tasks, especially at Einstein. In every case I've examined, the fault has been that the BOINC client has requested too much work - the same amount - again and again and again, once per minute (*), until it reaches the server limit of tasks sent. It's a CLIENT runaway, but very rare, and very hard to track down.

hucker75 commented 1 year ago

As I said above, it needs changing. I'd rather run 22 threads on CPU work and have 2 helping out much faster GPUs, than run 24 threads of CPU work and have idle GPUs. Either way makes full use of my CPU, but one leaves GPUs idle.

I propose if in panic mode, it should fill the GPUs first, as it does when not panicking. If some CPU tasks don't get done in time then so be it. The total amount of work I do is what's most important.

I'm not sure why I got too much Einstein work this time as I was on holiday. But I've seen the client runaway, and the problem of outdated server software (Primegrid is one guilty of this) not allowing different estimated times for CPU and GPU. According to them and other projects, updating the Boinc server is not a simple task.