Better resource utilization through load average monitoring #1333

Open BuildStream-Migration-Bot opened 3 years ago

See original issue on GitLab In GitLab by [Gitlab user @lle-bout] on Jun 1, 2020, 07:15

GNU make has -l for limiting number of jobs according to load average. See: https://www.gnu.org/software/make/manual/html_node/Parallel.html

Gentoo's emerge has --load-average with a similar effect. See: https://wiki.gentoo.org/wiki/EMERGE_DEFAULT_OPTS

I am thinking Buildstream could take advantage of such an option to spawn more build jobs in parallel until load average reached a configurable value.

It has the advantage of being quite trivial to implement without any changes or manual tagging of build recipes. In short, an easy performance win.

In GitLab by [Gitlab user @lle-bout] on Jun 1, 2020, 07:16

changed the description

In GitLab by [Gitlab user @lle-bout] on Jun 1, 2020, 07:17

changed the description

In GitLab by [Gitlab user @lle-bout] on Jun 1, 2020, 07:17

changed the description

In GitLab by [Gitlab user @tristanvb] on Jun 1, 2020, 08:43

marked this issue as related to #185

In GitLab by [Gitlab user @tristanvb] on Jun 1, 2020, 08:43

marked this issue as related to #633

In GitLab by [Gitlab user @tristanvb] on Jun 1, 2020, 09:06

The current approach is the max-jobs user configuration option and accompanying --max-jobs command line option which is exported as a hint to BuildElement implementations which can in turn communicate this to their build systems (the make element uses this to set -j %{max-jobs}).

This sort of thing has been discussed a lot, surprisingly I don't think we have a specific issue already open for this particularly :)

This is first related to #185 and #633

Off the top of my head, I can think of a couple of approaches which have mostly come up in the past.

Standardized job server

A job server is a simple token system which distributes tokens to active jobs, where schedulers might request the job server for a token and wait for one in order to launch a job, GNU Make implements one

One might imagine a system however where BuildStream implements some standardized job server which can integrate with various BuildElement implementations which support some API, exposing a socket or such for this within the execution Sandbox

BuildStream does not have any preference for a specific build system, and it will be impossible to really support every build system, as not every build system will even have support for a job server (a tool like make decides to support something like a make job server).

This approach to the problem is also complicated by remote execution, and might require additions to the standard REAPI to pursue at all.

Hard coded limitations and resource token attribution

Specifically in relation to #185, one has to consider not only available processing on a system but also available memory, often times we run into situations where handing out too many parallel jobs causes builds (like WebKit for instance) to fail at various link stages due to OOM scenarios.

We've found that usually a safe assumption for a build is that you might need 2G of RAM on the system for every process you allow a build to run in parallel (of course mileage may vary but this is generally a safe bet).

Along the same line of thinking, it's possible that we allow users to make very simple attributions as to the weight of a given build (or a job in BuildStream terminology, which might consist of many parallel jobs).

In this scenario we might be able to say that a build by default requires 2 units, which might mean 4G of ram and 2 available processors or threads, but allow users to increase or decrease the number of units required.

This approach is also interesting because of remote execution, we have a need to ensure that we don't bust resources on workers in a remote execution cloud, not sure how this would play into REAPI and related tooling like BuildGrid, BuildBox, and Bazel.

apache / buildstream

Better resource utilization through load average monitoring #1333

Standardized job server

Hard coded limitations and resource token attribution