easybuilders / easybuild-framework

EasyBuild is a software installation framework in Python that allows you to install software in a structured and robust way.
https://easybuild.io
GNU General Public License v2.0
147 stars 200 forks source link

make parallel builder aware of already submitted builds #113

Open boegel opened 11 years ago

boegel commented 11 years ago

The parallel builder should be aware of builds that are already submitted as jobs, so that it doesn't submit the same build twice when builds multiple software packages with dependencies are being submitted one after the other.

fgeorgatos commented 11 years ago

Hey, there is perhaps a cheap trick to do that: (ie. leave the "state" in a filesystem area)

create a filename in a lockfile in a working area "calculated" based on the produced module name, eg. $LOCKFILE=$LOCKDIR/echo $MYWANNABEMODULENAME | tr '/' '_' then do in the spawning code the equivalent of:

lockfile $LOCKFILE  ## ensure we are alone
qsub "easybuild() ; rm -f $LOCKFILE" || rm -f $LOCKFILE
lockfile $LOCKFILE ## this ensures no progress until the finish of the submitted task
rm -f $LOCKFILE

Because only the first entrant will grab the lock, any other ones will just stay on 'halt' mode awaiting; now as soon as the lock is released, the module has been (hopefully) produced and life goes on.

Caveat of proposed design: if the "common" task is failing for whatever reason, the module won't be built and it will be retried as many times as asked.

ps. I had to write a solution to a related problem for a local user this year... easyqsub.py anyone? ;-)

ref: http://pypi.python.org/pypi/lockfile/

boegel commented 11 years ago

That's one way, but working with lock files on the shared filesystem we have is a pain in the ass.

We have an API for PBS built into EasyBuild, so I'd rather use that directly to figure out which jobs were submitted before.

That way, we can submit the same build on different clusters (each of which has a different install path), without having to use some crazy naming scheme for the lock files.

And like you said, if something goes wrong and lock files stay behind for some reason, you can run into strange issues.

fgeorgatos commented 11 years ago

Hi to all easybuilders,

so, I am back from (my regular end-of-month sailing race) vacations and now trying to catch up;

I have been reading this page in relation to parallel builds in the FreeBSD world http://wiki.freebsd.org/SummerOfCode2012/Parallelization_in_the_ports_collection and come to the conclusion that handling the dependencies of packages in the parallel fashion is the really interesting stuff (think 2 WRF builds in parallel and such); that unavoidably implies some form of queue/dispatcher process in the same lines as described here: https://www.sara.nl/systems/lisa/software/disparm

That's one way, but working with lock files on the shared filesystem we have is a pain in the ass.

Indeed, also at the end of the disparm page this problem is described and even the tested workaround based on directories creation is provided.

btw. yes, we have NFS for $HOMEs like many others, too :-)

We have an API for PBS built into EasyBuild, so I'd rather use that directly to figure out which jobs were submitted before.

In that case, PBS becomes the "store" of the state and the solution would be only working on it.

Can we find a more generic approach instead?

IMHO, this issue falls in a class of more generic problems, on how to submit jobs on systems with primitive facilities (and going in this direction, would allow for generality across implementations)

to be continued...

boegel commented 11 years ago

Nice feedback @fgeorgatos, we'll make sure to take this into account when we're picking up on this.