cooperative-computing-lab / makeflow-examples

Example workflows for the Makeflow workflow system.
32 stars 18 forks source link

sge_submit_makeflow #40

Open mlap-t opened 3 years ago

mlap-t commented 3 years ago

Dear makeflow expert,

First thanks for the excellent tools that you've developped!

I was wondering if there was a tool similar to condor_submit_makeflow (see https://cctools.readthedocs.io/en/latest/man_pages/condor_submit_makeflow/).

Now when I run jobs on sge I have to remain logged in and this is problematic for long jobs and/or unstable networks.

thanks in advance for your help,

mathieu

btovar commented 3 years ago

There is sge_submit_workflow; it should be installed along condor_submit_workflow. Please let us know if you find it, and if it works for you.

dthain commented 3 years ago

Whoops, looks like that command is not part of the install rule.

btovar commented 3 years ago

Got it, I'll fix it now.

btovar commented 3 years ago

In the meantime, the script is here:

https://raw.githubusercontent.com/cooperative-computing-lab/cctools/master/makeflow/src/sge_submit_makeflow

mlap-t commented 3 years ago

Thanks a lot. I have to update my makeflow version from git then (so far i was using the binart tarball). I will do it later today or tomorrow. mathieu

mlap-t commented 3 years ago

Sorry for the delay in addressing this issue...

I normally run my jobs with this command: sge_submit_makeflow -T sge -B '-P P_antares -q long -l sps=1' --safe-submit-mode -J 250 --jx scan2d.jx

How shall I do with sge_submit_makeflow? Thanks for your help.

btovar commented 3 years ago

The command sge_submit_makeflow is designed to work with work queue, but I think for now we can trick it using an environment variable like (all in one command line):

makeflow_ops="-T sge -B '-P P_antares -q long -l sps=1' --safe-submit-mode -J 250 --jx" ./sge_submit_makeflow -p'-P_antares -q long -l sps=1'  scand2jxprojectname scan2d.jx

The double specification of '-P ...' etc. is needed because makeflow and the jobs may run on different queues.

When I'm back from the break I'll look for a more clean solution.

mlap-t commented 3 years ago

I tried this command: makeflow_ops="-T sge -B '-P P_km3net -q long -l sps=1' --safe-submit-mode -J 250 --jx" sge_submit_makeflow -p' -P P_km3net -q long -l sps=1' scand2jxprojectname scan2d.jx

It created a single job which I think was supposed to launch this shell script (sge_submit.sh)

#!/bin/sh
./makeflow -T wq -a -e -N scand2jxprojectname -T sge -B '-P P_km3net -q long -l sps=1' --safe-submit-mode -J 250 --jx 
scan2d.jx

but once the job start nothing happen and I have no logs. Any idea why? I couldn't find the meaning of the -e option. What about the two -T options (wq and sge) in the shell script?

btovar commented 3 years ago

I have proposed some changes here:

https://github.com/cooperative-computing-lab/cctools/pull/2503

The sge_submit_makeflow proposed new script: https://raw.githubusercontent.com/cooperative-computing-lab/cctools/7417c0d160af94f7991ffadc0add75cc0ff1b33b/makeflow/src/sge_submit_makeflow

The command line would look something like:

./sge_submit_makeflow -T sge -p '-P P_km3net -q long -l sps=1' -E '--safe-submit-mode -J 250 --jx' scand2jxprojectname  scan2d.jx

Please let me know if that works for you!

mlap-t commented 3 years ago

I ran the command you proposed but it does something very similar to using only 'makeflow' i.e. many jobs are submitted and I don't get back the shell prompt to logout. My understanding was that the script should allow to submit a single job to sge that would in-turn submit sge sub-jobs.

btovar commented 3 years ago

My mistake, I forgot that sge nodes usually cannot submit jobs by themselves. The way we usually handle this is by directing makeflow to use the wq batch system, and have additional sge jobs to serve as workers that execute the tasks. With this, rather than having one sge job per rule in your makeflow, you have one sge job per worker.

I submitted a couple of fixes for this to work correctly with sge_submit_makeflow:

https://github.com/cooperative-computing-lab/cctools/pull/2504 https://github.com/cooperative-computing-lab/cctools/blob/4bbdf06d9c9bd08c66e0e27ab8be5218221934a3/makeflow/src/sge_submit_makeflow

You would do something like:

./sge_submit_makeflow  -p '-P P_km3net -q long -l sps=1' -E '--safe-submit-mode -J 250 --jx' scand2jxprojectname  scan2d.jx

There is a chance that your workflow will not work as is. This is because with wq all references to files are made with respect to the local filesystem the jobs are running, rather than the shared filesystem assumed for sge jobs. This usually can be easily fixed by adding all input files (including executables) to your makeflow's rule specifications. We can help you to make this declarations in case you have any questions.