Open smoe opened 5 years ago
Please clarify - do you mean that jobs are submitted via slurm and processed by BOINC, or vice versa? Also, how is boinccmd involved? It's not part of the the job submission mechanism.
Are you referring to back filling your batch system with jobs or something else?
@lfield, yes, this kind of describes it. Ideally, the system administrators would run boinc in the background which would then idle itself whenever there is load. But this is often not desired, let alone for benchmarking some new algorithm when you do not want variable loads in the background. So, if I could submit BOINC tasks to our queue then this would settle it.
@davidpanderson, I admit not to know excatly what I was after. I just saw resources at my disposal that I cannot reach with the client we have. I would need something that knows how to execute a job and return the results (or buffer them in the home directory) in a way that these could be submitted when the job is completed. If I could run boincd interactively then this may be just fine. But you may have different ideas.
There are side-issues to it. I was proposing a buffer of results for instance that would collect the results from multiple machines. This may be more efficient than contacting the server for individual jobs. If that is truly more efficient, maybe that would also allow boinc to scale a bit more by introducing distributed.net-like proxy-buffers to integrate submissions from multiple users?
I think pilot jobs or job agents would work for your use case. Create a job that will configure and start the boinc client as you wish. You can then code your termination condition. Set no new tasks if you want this to be graceful. Playing with your batch scheduling, boinc client preferences, and termination condition should give you the flexibility to do what you want. If you do come up with a generic job, you may wish to share it.
@smoe would pilot jobs be a solution for you?
I felt a bit overwhelmed. Have read through http://saga-project.github.io/BigJob/ (googling for pilot job is not easy) to grasp what you may mean. My hurdle is that I do not see the command line to submit to, say, our slurm cluster that would make BOINC behave like a regular executable. The project maintainers would just go and and execute their application directly. The binary I envision would fetch that application (or check a hashsum of something already in the home directory), fetch data, compute and submit. This today all done within boincd from what I understood, so boincd (or some flavour of it) would need to get an option to be started interactively.
A pilot job is just a place holder job that is submitted via the batch system. Once it has a resource it downloads the real job from somewhere else. It is also known as late binding as you can delay the real scheduling to the last minute.
In this context it can be a simple bash script that controls the boinc daemon. Assuming everything is already configured, running one job can be achieved with the following.
boinc -dir . --exit_after_finish
I kept my head spinning about it. Some wrapping bash script is not a problem at all. Just,, are the naming constrains for the -dir ? Does this require one directory per machine, then it would be $HOME/boinc_$(hostname -s)? Is it one directory per job started, then the job ID becomes part of the path. Preferably there would be a single directory shared with all boinc jobs that can read+write access that directory on the cluster. But then - who uploads the results? And how does one not get confused when such a shared directory on a cluster has ... what ... 1000 slots?
There are no naming constraints for -dir, which is set to be the current directly. Usually jobs are run in a temporary scratch space. Hence the directory is not shared but is transient and only exists for the duration of the job.
Ah - ok - so let us assume /tmp/boinc-scratch for the boinc directory. The wrapping pilot script is executed on the target machine and would
# test is boinc directory exists
if [ ! -d /tmp/boinc-scratch ]; then
# create boinc directory
mkdir /tmp/boinc-scratch
# attach to the project
boinccmd -dir /tmp/boinc-scratch --project_attach URL auth
fi
# ask BOINC to just run a single job and then exit
boinccmd -dir /tmp/boinc-scratch --exit_after_finish
That directory is not removed upon completion, I presume. So a later job can reuse that.
I would clean up the directory afterwards and make it unique with mktemp -d . Otherwise if you have a multi-core machine you can't run parallel jobs. If the batch system already runs the job in a temporary workspace then you just make the directory there.
You need to use --fetch_minimal_work as well as --exit_when_idle; otherwise it will keep fetching jobs.
On a machine with multiple cores and/or GPUs, BOINC will (with these options) request enough work to use all device instances. With N cores, this could be one N-thread job, or N 1-thread jobs. It waits until all these jobs have finished, then exits.
On Mon, Jul 1, 2019 at 5:38 AM lfield notifications@github.com wrote:
I would clean up the directory afterwards and make it unique with mktemp -d . Otherwise if you have a multi-core machine you can't run parallel jobs. If the batch system already runs the job in a temporary workspace then you just make the directory there.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/2728?email_source=notifications&email_token=AAHQVALXHBHJDSWKEJ4MX7LP5H3ENA5CNFSM4FYP4NY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODY57TZQ#issuecomment-507247078, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHQVAL67AHVZ7BXVNGAG73P5H3ENANCNFSM4FYP4NYQ .
To approach that pilot script I reserved an interactive shell:
salloc --ntasks=2 --mem=4G srun --pty /bin/bash -i
From there I started boinc that I just compiled from source with what conda provides. This got me attached to Einstein but now I am stuck at what I thought would be easy to fix but I cannot find it:
02-Jul-2019 15:34:41 Initialization completed
02-Jul-2019 15:34:41 [---] Suspending computation - computer is in use
02-Jul-2019 15:34:41 [---] Suspending network activity - computer is in use
^C02-Jul-2019 15:34:45 [---] Received signal 2
02-Jul-2019 15:34:45 [---] Exiting
But the machine is idle.
$ uptime
15:35:14 up 106 days, 4:41, 0 users, load average: 0.01, 0.09, 0.12
I think to have a very forgiving setting at global_prefs.xml. The machine has network, boinc states to use the right HTTP proxy. Ideas?
^C02-Jul-2019 15:34:45 [---] Received signal 2
According to this line you interrupted it. right?
Yes, sorry, it was not going anywhere. Einstein lists the machine and assigned a task to it but it was never fetched.
I believe it is because computer was in use so every other activity was paused:
02-Jul-2019 15:34:41 [---] Suspending computation - computer is in use 02-Jul-2019 15:34:41 [---] Suspending network activity - computer is in use
You were right. I seem to have found it. Had to set the CPU used threshold to 100. That value should have been ignored - but with so many project-specific settings ... there likely was an oversight on my side.
I do not see that this should be done in an automated fashion, though. @davidpanderson, would you accept command line options as the ultimate override to project settings? Maybe something like "--override parameter value"?
Maybe this could wok for you https://boinc.berkeley.edu/wiki/Client_configuration ?
If the previous plan was to start completely de novo, this would require to "sed -i" the right config files, and you don't really know how some innocent change of project preferences disturbes that threshold. I find this a bit awkward. That uncertainty would not go away if I copy the currently working directory as a template, if I understand this right.
Side-issue: Since for some reason the machine was assigned to "home" instead of "generic", the "gpu-only" option of Einstein@Home did not kick in. I hence got non-GPU tasks which are assigned to the same physical CPU:
12917 me 30 10 13.577g 201848 93144 R 83.7 0.1 4:22.13 hsgamma_FGRPB1G
12690 me 39 19 320884 309416 2920 R 2.3 0.1 1:22.69 hsgamma_FGRP5_1
12691 me 39 19 322920 311452 2920 R 2.3 0.1 1:22.69 hsgamma_FGRP5_1
12692 me 39 19 599488 586084 2796 R 2.3 0.2 1:22.72 hsgamma_FGRP5_1
12694 me 39 19 599488 586120 2796 R 2.3 0.2 1:22.73 hsgamma_FGRP5_1
12695 me 39 19 320840 309368 2920 R 2.3 0.1 1:22.66 hsgamma_FGRP5_1
12698 me 39 19 320864 309392 2920 R 2.3 0.1 1:22.80 hsgamma_FGRP5_1
12702 me 39 19 322924 311460 2920 R 2.3 0.1 1:22.71 hsgamma_FGRP5_1
Seems the number of processing units I also want to change via the command line, not via some config file.
I found that boinc-run-interactively to be fully responsive to the boinccmd (command line boinc-manager). This is nice and admittedly was not ultimately expected.
Version 7.17.0 doesn't seem to have the option -dir
or --dir
. How do I change the directory used by the job?
Hello, Is there a humane way to submit BOINC tasks as jobs in slurm/... ? Nodes have Internet access and a shared home directory. I would not mind if the boinccmd would submit those jobs on my behalf, but what I had in mind was to submit the boinccmd itself within a bash script for a certain number of tasks (likely 1) and it would in my responsibility to reserve sufficient time for it. Many thanks Steffen