Improve the out-of-box experience for scientists

davidpanderson commented 3 years ago

Suppose a scientist (let’s call her Mary) needs lots of high-throughput computing and can’t afford the usual sources. Let’s assume that

Mary’s programs are Linux/Intel executables or Python scripts. She normally runs them on a Linux laptop.
Mary has access to a Linux server on the Internet. She doesn’t necessarily have root access, but can ask a sysadmin to install packages.
Mary knows Linux as a user, but not Docker, databases, web servers, AWS, etc.

Mary hears about volunteer computing and BOINC, and decides to investigate it. Mary will use BOINC only if this initial “out-of-box experience” (OOBE) is positive; i.e. she quickly tries out BOINC and is convinced that it works, that it’s useful to her, and that she wants to use it going forward. The ideal scenario is something like:

Mary hears about BOINC and goes to the web site.
Within ~1 hour she successfully runs jobs, using existing applications, on ~100 volunteer computers.
What she ends up with is something that she can continue to use in production, and to which she can add other applications, GPU apps, larger volumes of jobs and data, BOINC features like result validation, etc.

The current BOINC OOBE doesn’t achieve this. The main BOINC server documentation (https://boinc.berkeley.edu/trac/wiki/ProjectMain) is a sprawling mess. Marius’ Docker work (https://github.com/marius311/boinc-server-docker/blob/master/docs/cookbook.md) is a big step in the right direction, but more is needed to complete the above scenario.

BOINC competes with systems like HTCondor and AWS. We should study the OOBEs of these systems, borrow their good ideas, and make sure that we’re competitive. See, for example, https://www.youtube.com/channel/UCd1UBXmZIgB4p85t2tu-gLw

The goal

The following is a sketch of what I think the OOBE should be like. The target configuration involves:

A “server host”. This runs a BOINC server, as a set of Docker containers. It must be on a machine visible to the outside Internet, possibly a cloud instance.
One or more “job submission hosts”. Scientists log in to these to do their work. They may be behind a firewall.

Setting up the server host

This involves downloading a .gz file containing the BOINC server software and some VM and docker images. Then you run a script that asks one or two questions, then creates and runs a server (as Docker processes). It creates a read-me file saying:

How to make the server start on boot (edit /etc files).
Where the config files are in case you need to change something later.

Admin functions (start/stop server, create accounts for job submitters) are done through a web interface. After the initial setup there should be no need to log in.

Setting up a job submission host

This involves installing a package that contains job submission scripts (see below) but not the BOINC server.

Running jobs

We should handle at least two cases:

The scientist has an executable and the libraries it needs.
The scientist has a Python script and the modules it needs.

In each case, let’s assume that all files for an app are stored in a directory.

To submit a job:

boinc_run --app app_dir_path

Run this in a directory containing input files. It makes a job with those input files, running the given app. The file “cmdline”, if present, contains command-line args.

To run multiple jobs, create a directory for each job, and put input files there. Then do

boinc_run_jobs --app app_dir_path dir1 dir2 ...

To see the status of the job(s) started in the current directory:

boinc_status

If the job failed, show info like stderr output.

To abort jobs started in the current directory.

boinc_abort

To fetch the output files of completed jobs started in the current directory.

boinc_fetch

Note: fancier features can be added to this, but the basic features are ultra-simple. No XML editing, estimating job sizes, etc.

Implementation

The implementation shouldn’t be that hard. It’s based on technology we already have: boinc-server-docker and boinc2docker, and the remote job and file management mechanisms.

The server host setup script creates a BOINC project running in Docker containers, equipped with the VBox-based universal app, and some standard Docker containers, e.g. for Python apps.

On the submission host, each user has a directory ~/.boinc to contain various configuration and status files. A file ~/.boinc/apps contains a list of applications that have been used. Each one is identified by a directory path. We keep track of the mod time of the directory and the files in it; we maintain a Docker layer corresponding to the application.

The boinc_run command (a Python script) does the following:

Check ~/.boinc/apps to see whether we have a Docker layer for the app. If not, build one using boinc2docker.
Use the remote file management mechanism to copy files (app and input) to the Apache container.
Use the remote job submission mechanism to submit the job. Write its ID to a file.

boinc_status etc. use the remote job submission mechanism.

Computing resources

The scientist starts by running the BOINC client on one or more of their own computers (possibly Windows or Mac), and attaching to the project.

When things are working and they’re ready to scale up, they register with Science United, supplying their keywords. The vetting process may take a day or two. This will typically provide them with several hundred hosts.

Another possibility is to allow Science United users to register as “testers”, and to add a mechanism where projects can register as “test projects” on SU, with no vetting. Such projects would be allowed only to use VM apps with no network access (we’d need to add a mechanism for this). They’d get some number of hosts (50-100) for a few days.

Restructuring server documentation

Once we have this working, we need to reorganize the server docs in such a way that scientists are initially steered toward the OOBE described here, but can still access lower-level info.

Ageless93 commented 3 years ago

When things are working and they’re ready to scale up, they register with Science United, supplying their keywords.

Is the future for new projects using BOINC to only (automatically) work together with Science United? Also because Mary doesn't know anything about web servers.

cminnoy commented 3 years ago

I totally agree there should be simple easy ways to submit a single job, and multiple jobs, but first of all, Mary will need to have an easy way of setting up a project server (locally or if she chooses remotely in the Cloud). Mary wants to have decent documentation and preferably a book of high quality that guides her with all steps, provides lots of examples and recipes on how to cook her server and project, and with chapters on how to create her own tasks for different platforms. Mary is used to see how easy it is for her colleague to span new tasks on AWS or similar, so she would like to do similarly, but much better, easier and more performant with BOINC. Only then will Mary have time to take care of her little lam.

I like docker but I'm wondering if running docker images inside a VirtualBox is very efficient. Simply doing an 'Hello World' takes a vast amount of resources on the client side. Many megabytes need to be downloaded by the client that don't add to the task at hand, many CPU cycles wasted on emulating and booting the ISO, diskspace wasted. And VirtualBox is not ok for GPU computing (which also should be made easy for Mary). Yes, its kind of easy, but efficiency should be relevant. Why not also look into running tasks under WSL, or simply spawn the docker images directly on the client machine (if linux based).

smoe commented 2 years ago

There was a time (about 8 years ago) when the the Debian package that created BOINC project servers was not completely useless. I would very much love to see this revived. But I would also love someone else to address this :)

AenBleidd commented 2 years ago

Converting this to Conversation since it's a big topic to discuss before creating any particular tasks to be implemented.

BOINC / boinc