Illumina / pyflow

A lightweight parallel task engine
http://Illumina.github.io/pyflow/
146 stars 44 forks source link

Add cwd argument to addWorkflowTask #21

Open virajbdeshpande opened 6 years ago

virajbdeshpande commented 6 years ago

Currently, I can use "cwd" argument as shown in the cwd demo for "addTask", but it gives me an error "unexpected keyword argument" if I use it with "addWorkflowTask".

ctsa commented 6 years ago

The client API docs may help clarify which arguments each method accepts:

http://illumina.github.io/pyflow/WorkflowRunner_API_html_doc/index.html

We could potentially add this for addWorkflowTask, but what are the semantics you're looking for in this case? Could the same thing be accomplished with os.chdir(path) at the top of the added workflow instance?

virajbdeshpande commented 6 years ago

Thanks.

Here is an example use case. You have a dataset of multiple samples (parent workflow) and you want to run multiple analysis for each sample in a different directory (subworkflow). Each analysis gets its own workflow (subsubworkflow) and subdirectory within the sample directory.

Let's say rootdir is the directory where we run the script/parent workflow and the cwd for the subworkflow is path. Then the semantics for the usecase above will be as follows: 1) if I set cwd=path when calling subworkflow, the working directory for the subsubworkflow should automatically be set to path and not rootdir unless changed by subworkflow using the cwd argument. In short, any subworkflow should be oblivious of rootdir and only inherits cwd from its parent. 2) It is not directly clear to the user whether it is required to do os.chdir(rootdir) at the end of the subworkflow or will the parent workflow continue to run in rootdir. So having the cwd encoded in an argument clarifies that the user does not need to switch back.

For point (1), in the current version, subsubworkflow still runs in the rootdir even if I do os.chdir(path) within subworkflow.

For point (2), I confirmed that the tasks enter a race condition when using os.chdir on a local run. For example, here are two directory structures that get created by running the pyflow scripts twice:

RUN1: Correct structure ./2015-2802/fastq_cat ./2015-2802 ./2015-2799/fastq_cat ./2015-2799

RUN2: Incorrect structure ./2015-2802/fastq_cat ./2015-2802/2015-2799/fastq_cat ./2015-2802/2015-2799 ./2015-2802 ./2015-2799

Do you think this will be fixed any time soon? Alternatively, I can switch to using absolute paths everywhere within the script and only run shell commands for external tools through addTask(cwd). It is easier to write a bash script to deploy the Pyflow script separately for each sample, but that defeats the purpose of using Pyflow.