DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
893 stars 241 forks source link

toil-wdl-runner generates scripts that ignore extra arguments like --batchSystem Slurm #2721

Closed azzaea closed 4 years ago

azzaea commented 5 years ago

Hi,

So, toil has some support for wdl code via the toil-wdl-runner. Can this be used to submit jobs to slurm scheduler?

From the documentation, I'm doing the following:

$ export TOIL_SLURM_ARGS="-p noraml"
$ toil-wdl-runner --batchSystem=SLURM sleep_process.wdl sleep_process_workflow.json
Traceback (most recent call last):
  File "/igbgroup/a-m/azzaea/scalability-tst/wdl/toilwdl_compiled.py", line 1, in <module>
    from toil.job import Job
  File "/igbgroup/a-m/azzaea/scalability-tst/wdl/toil.py", line 1, in <module>
    from toil.job import Job
ImportError: No module named job
Traceback (most recent call last):
  File "/home/apps/software/Python/2.7.13-IGB-gcc-4.9.4/bin/toil-wdl-runner", line 11, in <module>
    sys.exit(main())
  File "/home/apps/software/Python/2.7.13-IGB-gcc-4.9.4/lib/python2.7/site-packages/toil/wdl/toilwdl.py", line 141, in main
    subprocess.check_call(cmd)
  File "/home/apps/software/Python/2.7.13-IGB-gcc-4.9.4/lib/python2.7/site-packages/subprocess32.py", line 307, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess32.CalledProcessError: Command '['python', '/igbgroup/a-m/azzaea/scalability-tst/wdl/toilwdl_compiled.py', '--batchSystem=SLURM']' returned non-zero exit status 1.

Where am I going wrong?

Thank you, Azza

┆Issue is synchronized with this Jira Task ┆Issue Number: TOIL-386

joelarmstrong commented 5 years ago

Hi Azza,

Looks like you might have a file named toil.py in your current directory there (/igbgroup/a-m/azzaea/scalability-tst/wdl/). That confuses python a bit because it shadows the toil library -- when trying to import the toil library, python instead loads that toil.py file. Moving or renaming that file should help.

azzaea commented 5 years ago

Wow! That was quick! I'm double checking now, and have removed the toil.py script. However, my workflow is still processed locally (and not submitted to slurm- as I can see no jobs running by checking squeue ).

I have seen with CWL that a way to submit to slurm is as follows:

$ toil-wdl-runner --batchSystem Slurm sleep_process.wdl sleep_process_workflow.json
Traceback (most recent call last):
  File "/home/apps/software/Python/2.7.13-IGB-gcc-4.9.4/bin/toil-wdl-runner", line 11, in <module>
    sys.exit(main())
  File "/home/apps/software/Python/2.7.13-IGB-gcc-4.9.4/lib/python2.7/site-packages/toil/wdl/toilwdl.py", line 107, in main
    raise RuntimeError('Unsupported Secondary File Type.  Use json.')
RuntimeError: Unsupported Secondary File Type.  Use json.

However, a json file seems missing. Not sure how to progress from here. Would you kindly advise?

adamnovak commented 5 years ago

Something is going wrong with toilwdl's handling of unknown arguments, which it wants to forward on to the synthesized Toil workflow that it generates. I suspect --batchSystem is being taken as the first positional argument to the generator script, and Slurm is being taken as the second.

Try specifying the .wdl and .json files as the first two arguments, and moving the extra arguments to control execution of the generated Toil workflow to the end of the command line:

toil-wdl-runner sleep_process.wdl sleep_process_workflow.json --batchSystem Slurm
azzaea commented 5 years ago

Thank you @adamnovak . toil-wdl-runner simply ignores this part --batchSystem Slurm when called as per your hint (so it is the equivalent of just calling toil-wdl-runner sleep_process.wdl sleep_process_workflow.json), and the processing is done locally on the head node in this case

adamnovak commented 5 years ago

Hm. That's definitely a bug. You might be able to work around it by getting the generator to generate the workflow code and then invoking it manually. The real fix is to work out what is going wrong with the argument processing in toil-wdl-runner and fix it.

azzaea commented 5 years ago

Bomber! I got soo interested in toil, as it is the only other executor (to the best of my knowledge) that runs wdl, and it is quite fast compared with cromwell (I'm actually benchmarking a few workflow management systems at present, testing their limits, where they excell, and where one is better off with system x or maybe something entirely different)

I'm not familiar with the code base here, but a quick look here suggests that the other arguments (like for a resource manager) should be handled here but they are not.

The other suggestion of working with generator code manually. Would you kindly elaborate?

Thank you again, Azza

adamnovak commented 5 years ago

The extra arguments are supposed to be forwarded on to the generated script at https://github.com/DataBiosphere/toil/blob/47befdcdbfaa9e43d396ee828cbff87050eb9dde/src/toil/wdl/toilwdl.py#L147

Basically, toil-wdl-runner translates WDL into Python code that uses Toil as a library, and runs the resulting Python code. This is different from how we handle CWL, which is more or less interpreted.

If you look at https://github.com/DataBiosphere/toil/blob/47befdcdbfaa9e43d396ee828cbff87050eb9dde/src/toil/wdl/toilwdl.py#L140 it looks like if you put toil-wdl-runner into "dev mode", you can generate the Python translation of the workflow, but not immediately run and delete it. So you should be able to run toil-wdl-runner with the WDL, JSON, and "--dev_mode", find the .py file which I think it drops in the current directory, and then run that with the other arguments you want to use.

On 6/24/19, Azza Ahmed notifications@github.com wrote:

Bomber! I got soo interested in toil, as it is the only other executor (to the best of my knowledge) that runs wdl, and it is quite fast compared with cromwell (I'm actually benchmarking a few workflow management systems at present, testing their limits, where they excell, and where one is better off with system x or maybe something entirely different)

I'm not familiar with the code base here, but a quick look here suggests that the other arguments (like for a resource manager) should be handled here but they are not.

The other suggestion of working with generator code manually. Would you kindly elaborate?

Thank you again, Azza

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/DataBiosphere/toil/issues/2721#issuecomment-505157306

azzaea commented 5 years ago

Well, I think it needs to be more complicated than this.

The compiled file (if not made aware of special parameters; like the scheduler and runtime attributes in WDL tasks is clueless as to how to submit jobs and just includes the raw shell invocation from the command part of the WDL task. Looking at this file for my code, I believe that somewhere near line 74 there should have been some kind of wrapper with the scheduler's appropriate commands: sbatch for slurm, qsub for torque, .. etc customized as per the WDL task runtime attributes (memory, cpu, .. etc)

If it helps, I'm attaching all of:
the compiled toil code (which is runnable on its own); the original wdl code and my json file

adamnovak commented 5 years ago

The actual sbatch/qsub stuff happens behind the scenes in the toil module, bits of which get imported at the top of the generated script. Toil picks the batch system to actually used based on the options it is passed. For Slurm, the actual command generation logic is here.

The commands that get submitted aren't just the raw sleep commands that your WDL workflow asks for, but rather commands that run some Toil Python stuff that imports the script and runs the run() method of the appropriate class, depending on what step in the workflow it is trying to do. The requirements for the jobs are set up by the classes' __init__ methods; you can see that the one for running the sleep command requires at least 1 core.

I think the real problem here is that the generated code does not actually bother parsing its command line arguments. It just says:

options = Job.Runner.getDefaultOptions("./toilWorkflowRun")

The single machine batch system is the default, and getDefaultOptions doesn't look at sys.argv to see if any other batch system has been specified.

According to the docs, we would want something like this to turn it into a real well-beheved Toil script that will respect --batchSystem:

parser = Job.Runner.getDefaultArgumentParser()
options = parser.parse_args()

If you swap in that code instead, you would be able to run it with something like:

python toilwdl_compiled.py ./toilWorkflowRun --batchSystem slurm

I think we need to change the code generator to include code that parses command line arguments, and also to pass along the job store (./toilWorkflowRun in this case, where the data describing in-flight jobs is stored) as a command-line argument rather than embedding it in the generated script.

azzaea commented 4 years ago

Thank you @adamnovak - Got a chance to look at this again, and this pull request does the trick :)

For completion, I'm putting below how to do things:

$ # export needed slurm options
$ export TOIL_SLURM_ARGS="-p noraml"
$
$ ## Manual invocation:
$ # 1. Generate a compiled python script:
$ toil-wdl-runner --dev_mode 3 <wdl script> <json inputs file>
$
$ # 2. Run that script to slurm:
$ python toilwdl_compiled.py --batchSystem Slurm <job store>
$
$ ## Alternatively, if invoking the toil parser (and doing it all at once):
$ toil-wdl-runner <wdl script> <json inputs file> --batchSystem Slurm <job store>