BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.95k stars 439 forks source link

Client starts set of jobs too large for system memory #5641

Closed davidpanderson closed 3 weeks ago

davidpanderson commented 3 weeks ago

(This problem was reported by Glenn from CPDN)

For each runnable job we have several estimates of its future WSS:

a) the project-supplied value rsc_memory_bound b) APP_VERSION::max_working_set_size: the max measured WSS of jobs using this app version since the client started (not saved to state file) c) if the job has already run, ACTIVE_TASK::working_set_size_smoothed: recent average (on the order of 1 min) of its WSS

Current policy: In job scheduling, for job WSS we use if job has run: c) else if b) is nonzero: b) else a)

Problem: CPDN jobs run for a few minutes with small WSS (say, 1MB). Then they grow to full WSS (say, 6 GB). There are various scenarios in which this leads to problems. E.g. suppose the host has 16GB RAM and 8 cores. It gets 8 CPDN jobs. It starts 2 of them. Their WSS is measured as 1MB. On the next reschedule the client starts 2 more CPDN jobs. Eventually all 4 jobs expand to 6GB WSS. This is bigger than RAM and maybe bigger than swap space. Some of the jobs fail with memory allocation failure.

Solution: Change the WSS policy to: If job has run: max(a, c) else max(b, c)

Note: What happens if project's rsc_memory_bound is wildly wrong? if too large: client may run fewer jobs for that project if too small: same as current: client may run too many jobs, and they may fail (and possibly cause jobs of other projects to fail) with malloc failure