BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.95k stars 439 forks source link

Integration with slurm / GridEngine / Torque / you-name-it queueing system #2728

Open smoe opened 5 years ago

smoe commented 5 years ago

Hello, Is there a humane way to submit BOINC tasks as jobs in slurm/... ? Nodes have Internet access and a shared home directory. I would not mind if the boinccmd would submit those jobs on my behalf, but what I had in mind was to submit the boinccmd itself within a bash script for a certain number of tasks (likely 1) and it would in my responsibility to reserve sufficient time for it. Many thanks Steffen

davidpanderson commented 5 years ago

Please clarify - do you mean that jobs are submitted via slurm and processed by BOINC, or vice versa? Also, how is boinccmd involved? It's not part of the the job submission mechanism.

lfield commented 5 years ago

Are you referring to back filling your batch system with jobs or something else?

smoe commented 5 years ago

@lfield, yes, this kind of describes it. Ideally, the system administrators would run boinc in the background which would then idle itself whenever there is load. But this is often not desired, let alone for benchmarking some new algorithm when you do not want variable loads in the background. So, if I could submit BOINC tasks to our queue then this would settle it.

@davidpanderson, I admit not to know excatly what I was after. I just saw resources at my disposal that I cannot reach with the client we have. I would need something that knows how to execute a job and return the results (or buffer them in the home directory) in a way that these could be submitted when the job is completed. If I could run boincd interactively then this may be just fine. But you may have different ideas.

There are side-issues to it. I was proposing a buffer of results for instance that would collect the results from multiple machines. This may be more efficient than contacting the server for individual jobs. If that is truly more efficient, maybe that would also allow boinc to scale a bit more by introducing distributed.net-like proxy-buffers to integrate submissions from multiple users?

lfield commented 5 years ago

I think pilot jobs or job agents would work for your use case. Create a job that will configure and start the boinc client as you wish. You can then code your termination condition. Set no new tasks if you want this to be graceful. Playing with your batch scheduling, boinc client preferences, and termination condition should give you the flexibility to do what you want. If you do come up with a generic job, you may wish to share it.

lfield commented 5 years ago

@smoe would pilot jobs be a solution for you?

smoe commented 5 years ago

I felt a bit overwhelmed. Have read through http://saga-project.github.io/BigJob/ (googling for pilot job is not easy) to grasp what you may mean. My hurdle is that I do not see the command line to submit to, say, our slurm cluster that would make BOINC behave like a regular executable. The project maintainers would just go and and execute their application directly. The binary I envision would fetch that application (or check a hashsum of something already in the home directory), fetch data, compute and submit. This today all done within boincd from what I understood, so boincd (or some flavour of it) would need to get an option to be started interactively.

lfield commented 5 years ago

A pilot job is just a place holder job that is submitted via the batch system. Once it has a resource it downloads the real job from somewhere else. It is also known as late binding as you can delay the real scheduling to the last minute.

In this context it can be a simple bash script that controls the boinc daemon. Assuming everything is already configured, running one job can be achieved with the following.

boinc -dir . --exit_after_finish

smoe commented 5 years ago

I kept my head spinning about it. Some wrapping bash script is not a problem at all. Just,, are the naming constrains for the -dir ? Does this require one directory per machine, then it would be $HOME/boinc_$(hostname -s)? Is it one directory per job started, then the job ID becomes part of the path. Preferably there would be a single directory shared with all boinc jobs that can read+write access that directory on the cluster. But then - who uploads the results? And how does one not get confused when such a shared directory on a cluster has ... what ... 1000 slots?

lfield commented 5 years ago

There are no naming constraints for -dir, which is set to be the current directly. Usually jobs are run in a temporary scratch space. Hence the directory is not shared but is transient and only exists for the duration of the job.

smoe commented 5 years ago

Ah - ok - so let us assume /tmp/boinc-scratch for the boinc directory. The wrapping pilot script is executed on the target machine and would

# test is boinc directory exists
if [ ! -d /tmp/boinc-scratch ]; then
  # create boinc directory
  mkdir /tmp/boinc-scratch
  # attach to the project
  boinccmd -dir /tmp/boinc-scratch --project_attach URL auth
fi
# ask BOINC to just run a single job and then exit
boinccmd -dir /tmp/boinc-scratch --exit_after_finish

That directory is not removed upon completion, I presume. So a later job can reuse that.

lfield commented 5 years ago

I would clean up the directory afterwards and make it unique with mktemp -d . Otherwise if you have a multi-core machine you can't run parallel jobs. If the batch system already runs the job in a temporary workspace then you just make the directory there.

davidpanderson commented 5 years ago

You need to use --fetch_minimal_work as well as --exit_when_idle; otherwise it will keep fetching jobs.

On a machine with multiple cores and/or GPUs, BOINC will (with these options) request enough work to use all device instances. With N cores, this could be one N-thread job, or N 1-thread jobs. It waits until all these jobs have finished, then exits.

On Mon, Jul 1, 2019 at 5:38 AM lfield notifications@github.com wrote:

I would clean up the directory afterwards and make it unique with mktemp -d . Otherwise if you have a multi-core machine you can't run parallel jobs. If the batch system already runs the job in a temporary workspace then you just make the directory there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2728?email_source=notifications&email_token=AAHQVALXHBHJDSWKEJ4MX7LP5H3ENA5CNFSM4FYP4NY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY57TZQ#issuecomment-507247078, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHQVAL67AHVZ7BXVNGAG73P5H3ENANCNFSM4FYP4NYQ .

smoe commented 5 years ago

To approach that pilot script I reserved an interactive shell:

salloc --ntasks=2 --mem=4G srun --pty /bin/bash -i

From there I started boinc that I just compiled from source with what conda provides. This got me attached to Einstein but now I am stuck at what I thought would be easy to fix but I cannot find it:

02-Jul-2019 15:34:41 Initialization completed
02-Jul-2019 15:34:41 [---] Suspending computation - computer is in use
02-Jul-2019 15:34:41 [---] Suspending network activity - computer is in use
^C02-Jul-2019 15:34:45 [---] Received signal 2
02-Jul-2019 15:34:45 [---] Exiting

But the machine is idle.

$ uptime
 15:35:14 up 106 days,  4:41,  0 users,  load average: 0.01, 0.09, 0.12

I think to have a very forgiving setting at global_prefs.xml. The machine has network, boinc states to use the right HTTP proxy. Ideas?

AenBleidd commented 5 years ago

^C02-Jul-2019 15:34:45 [---] Received signal 2

According to this line you interrupted it. right?

smoe commented 5 years ago

Yes, sorry, it was not going anywhere. Einstein lists the machine and assigned a task to it but it was never fetched.

AenBleidd commented 5 years ago

I believe it is because computer was in use so every other activity was paused:

02-Jul-2019 15:34:41 [---] Suspending computation - computer is in use 02-Jul-2019 15:34:41 [---] Suspending network activity - computer is in use

smoe commented 5 years ago

You were right. I seem to have found it. Had to set the CPU used threshold to 100. That value should have been ignored - but with so many project-specific settings ... there likely was an oversight on my side.

I do not see that this should be done in an automated fashion, though. @davidpanderson, would you accept command line options as the ultimate override to project settings? Maybe something like "--override parameter value"?

AenBleidd commented 5 years ago

Maybe this could wok for you https://boinc.berkeley.edu/wiki/Client_configuration ?

smoe commented 5 years ago

If the previous plan was to start completely de novo, this would require to "sed -i" the right config files, and you don't really know how some innocent change of project preferences disturbes that threshold. I find this a bit awkward. That uncertainty would not go away if I copy the currently working directory as a template, if I understand this right.

Side-issue: Since for some reason the machine was assigned to "home" instead of "generic", the "gpu-only" option of Einstein@Home did not kick in. I hence got non-GPU tasks which are assigned to the same physical CPU:

12917 me 30  10 13.577g 201848  93144 R  83.7  0.1   4:22.13 hsgamma_FGRPB1G
12690 me 39  19  320884 309416   2920 R   2.3  0.1   1:22.69 hsgamma_FGRP5_1
12691 me 39  19  322920 311452   2920 R   2.3  0.1   1:22.69 hsgamma_FGRP5_1
12692 me 39  19  599488 586084   2796 R   2.3  0.2   1:22.72 hsgamma_FGRP5_1
12694 me 39  19  599488 586120   2796 R   2.3  0.2   1:22.73 hsgamma_FGRP5_1
12695 me 39  19  320840 309368   2920 R   2.3  0.1   1:22.66 hsgamma_FGRP5_1
12698 me 39  19  320864 309392   2920 R   2.3  0.1   1:22.80 hsgamma_FGRP5_1
12702 me 39  19  322924 311460   2920 R   2.3  0.1   1:22.71 hsgamma_FGRP5_1

Seems the number of processing units I also want to change via the command line, not via some config file.

smoe commented 5 years ago

I found that boinc-run-interactively to be fully responsive to the boinccmd (command line boinc-manager). This is nice and admittedly was not ultimately expected.

tardigradus commented 4 years ago

Version 7.17.0 doesn't seem to have the option -dir or --dir. How do I change the directory used by the job?