Closed jirikuncar closed 5 years ago
I think users would prefer less cryptic naming here... something like renku run --input src/ --output out/ python script.py
I'm still reading more on CWL, but here's my current status on this issue:
I don't think that this secondaryFile
option will allow us to specify generic non-input dependencies. I have the impression that CWL deliberately chooses not to support this and instead encourages users to write their tools in such a way that all inputs are passed in as arguments.
The secondaryFile
option seems to be more like a backdoor to enable very frequent but very specific use cases from bio-informatics, where a bunch of secondaryFiles
are piggy-backing on an actual input argument. However, secondaryFile
can not exist on their own.
Allowing users to specify secondaryFiles
through Renku would certainly be possible. In this case I would suggest to simply assume that if the primary file has not changed, the secondary files won't have changed either. I think this assumption would be very much in the spirit of how secondary files are specified in CWL. But I'm not sure how useful this feature would currently be.
Ok, I think the clean way of having non-input dependencies in terms of CWL is to still define these extra dependencies as inputs but without the (inputBinding
). As @mohammad-sdsc has pointed out, this will prevent the inputs from being passed to the command line as an argument. In this case the input is still staged for the cwl execution and the temporary path can be accessed using $(inputs.inputFile.path)
and for example exported as an environment variable. Then the code would have to be written such that it picks up non-input dependencies from env variables.
However, for the example of a python module which should be importable it could also be enough to just add the folder with the to-be-imported module to the PYTHONPATH.
I have a better picture now of how solution to this could be implemented, but I'm still not sure about th use case(s) and the expected behaviour.
Ok, I have another solution which comes closest to the desired general support of relative paths. It consists of defining an entire directory as an input, not binding this input to the command and finally using the InitialWorkDirRequirement
on this input directory. The last requirements causes the input directory to be staged in the working directory directly. Look at this minimal working example:
tool.cwl
cwlVersion: v1.0
class: CommandLineTool
baseCommand: ["python", "code/script.py"]
inputs:
codeDir:
type: Directory
outputs: []
requirements:
InitialWorkDirRequirement:
listing:
- $(inputs.codeDir)
job.yml
codeDir:
class: Directory
path: code
content of code directory
|____code
| |____script.py
| |____myLib
| | |____helloWorld.py
helloWorld.py
def helloWorld():
return "Hello World"
script.py
from myLib.helloWorld import helloWorld
print(helloWorld())
This could be invoked with renku run --depends-on ./code ...
. When building the KG or deciding whether to re-execute a step during renku update
, the code directory would just be considered a regular input. The drawback of this is that things like
--depends-on ../../some-dir
or
--depends-on ./code/subfolder/some-dir
will both not work as the cwl runner would stage some-dir
in the workdir.
Sounds great. But are we taking advantage of an oversight from CWL (allowing hardcoded inputs), that could become fixed in a future release to enforce best practices?
And can we have multiple --depends-on
, (could be secondaryFile), if we want to span several dirs or files, e.g. I imagine that requirements.txt and environment.yml would be good to have by defaults.
A Workflow Platform is where cwl-runner
is executing. It interprets the CWL document and sets up the Runtime Environment.
A Runtime Environment is where the command line tool is executing. It is created by cwl-runner
before the execution of the command line tool.
The Runtime Environment has a designated output directory where output files produced by tool execution are written into. It also has a designated temporary directory which can also be used as a writable directory. When the tool terminated content of this directory are not considered accessible. The system temporary directory (i.e. /tmp
) is also available to the tool for writting temporary files. The tool cannot write to anyother directory in the Runtime Environment other than these 3 directories.
The tool is executed in a new, empty environment where only a few environment variables are defined. Two important ones are HOME
which points to the absolute path of the designated output directory and TMPDIR
which points to the absolute path of the designated temporary directory.
When the execution of the tool starts, the current working directory is set to designated output directory.
The location of inputs of the tool is not defined by CWL and is implementation-dependent. Moreover, CWL does not define any mechanism (e.g. environment variables) to locate inputs. However, there are two guranttees regarding the relative location of inputs:
secondaryFiles
field, all the files and directories refferred in this field are copied over in the same location as the input (TODO: what if there is a relative path?).InitialWorkDirRequirement
, a symbolic link is created in the designated output directory which points to those entries (I'm not sure what happens if runs on another machine).To visualize things consider the following directory entry:
.
├── code
│ ├── file1
│ ├── script.py
│ └── submodule
│ └── file2
├── data
│ └── files
│ ├── file3
│ └── file4
├── file5
├── output
│ └── file6
└── tool.cwl
script.py
is an implementation of cat
command in Python where it puts the content of all input file arguments excluding the last one in the last one. It also reads and print "HOME" and "OUTDIR" environment variables:
import os
import sys
print("************************")
print("DESIGNATED OUTPUT DIRECTORY:", os.getenv("HOME"))
print("DESIGNATED TEMPORARY DIRECTORY:", os.getenv("TMPDIR"))
print("************************")
assert len(sys.argv) >= 3
with open(sys.argv[-1], 'w') as output:
for name in sys.argv[1:-1]:
with open(name) as input:
for line in input:
output.write(line)
The tool.cwl
runs python3 code/script.py code/file1 code/submodule/file2 data/files/file3 data/files/file4 output/file6
which basically concatenates contents of file1
, file2
, file3
, and file4
and puts the result in file5
. This is how it looks like in CWL:
arguments: []
baseCommand:
- python3
class: CommandLineTool
cwlVersion: v1.0
hints: []
inputs:
input_1:
default:
class: File
path: code/script.py
inputBinding:
position: 1
type: File
input_2:
default:
class: File
path: code/file1
inputBinding:
position: 2
type: File
input_3:
default:
class: File
path: code/submodule/file2
inputBinding:
position: 3
type: File
input_4:
default:
class: File
path: data/files/file3
inputBinding:
position: 4
type: File
input_5:
default:
class: File
path: data/files/file4
inputBinding:
position: 5
type: File
input_6:
default:
class: File
path: file5
inputBinding:
position: 6
type: File
input_7:
default: output/file6
inputBinding:
position: 7
type: string
outputs:
output_0:
outputBinding:
glob: $(inputs.input_7)
type: File
output_1:
outputBinding:
glob: output
type: Directory
permanentFailCodes: []
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entry: '$({"listing": [], "class": "Directory"})'
entryname: output
writable: true
successCodes: []
temporaryFailCodes: []
I used reference cwl-runner
to run the tool.
$ cwl-runner tool.cwl
/usr/local/bin/cwl-runner 1.0.20181012180214
Resolved 'tool.cwl' to 'file:///home/mohammad/various/comprehensive-cwl-examaple/tool.cwl'
[job tool.cwl] /tmp/tmp3o5evbhd$ python3 \
/tmp/tmpgcpuaxj2/stg5d098d76-6f49-4a65-ab38-aca6f4aaa7c2/script.py \
/tmp/tmpgcpuaxj2/stge21109cb-71b9-4548-9933-b84284323784/file1 \
/tmp/tmpgcpuaxj2/stg1dfecd9c-a60c-430e-877d-8b9ab6cfa2c7/file2 \
/tmp/tmpgcpuaxj2/stgab85c2ab-6535-43c8-b2d2-fef559d5fbe9/file3 \
/tmp/tmpgcpuaxj2/stg878ad716-a1fc-49ef-af09-0d94484950b6/file4 \
/tmp/tmpgcpuaxj2/stgcf970a58-3b10-46a2-81b7-1dc16fa2222c/file5 \
output/file6
************************
DESIGNATED OUTPUT DIRECTORY: /tmp/tmp3o5evbhd
DESIGNATED TEMPORARY DIRECTORY: /tmp/tmpk1b_5maz
************************
[job tool.cwl] completed success
{
"output_0": {
"location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
"basename": "file5",
"class": "File",
"checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
"size": 30,
"path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
},
"output_1": {
"location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output",
"basename": "output",
"class": "Directory",
"listing": [
{
"class": "File",
"location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
"basename": "file5",
"checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
"size": 24,
"path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
}
],
"path": "/home/mohammad/various/comprehensive-cwl-examaple/output"
}
}
Final process status is success
The cwl-runner
creates the runtime environment to execute the tool and prints the command which it uses to run the tool:
...
/tmp/tmp3o5evbhd$ python3 \
/tmp/tmpgcpuaxj2/stg5d098d76-6f49-4a65-ab38-aca6f4aaa7c2/script.py \
/tmp/tmpgcpuaxj2/stge21109cb-71b9-4548-9933-b84284323784/file1 \
/tmp/tmpgcpuaxj2/stg1dfecd9c-a60c-430e-877d-8b9ab6cfa2c7/file2 \
/tmp/tmpgcpuaxj2/stgab85c2ab-6535-43c8-b2d2-fef559d5fbe9/file3 \
/tmp/tmpgcpuaxj2/stg878ad716-a1fc-49ef-af09-0d94484950b6/file4 \
/tmp/tmpgcpuaxj2/stg078bb41c-d670-4e4e-a845-48e086951871/file5 \
output/file6
As you can see in this commnad, each input file is put in a different directory regardless of the initial input directory structure. For example, script.py
and file1
are in the same directory in the Workflow Platform's filesystem, but in the Runtime Environment one is put in /tmp/tmpgcpuaxj2/stg5d098d76-6f49-4a65-ab38-aca6f4aaa7c2/
and the other is put in /tmp/tmpgcpuaxj2/stge21109cb-71b9-4548-9933-b84284323784
. This means that if we access file1
directly in script.py
, we will get a FileNotFound
error when executing this tool.
In general, the CWL does not specify if runners should preserve the directory structure for input files and we cannot assume that runner implementations will do so.
Also note that the designated output directory which is /tmp/tmp3o5evbhd
is different than the directory where contains inputs, /tmp/tmpgcpuaxj2
. This means that, we cannot access output files directly in script.py
either.
secondaryFiles
FieldIf we add file1
as a secondaryFiles
to script.py
(everything else remain the same):
...
inputs:
input_1:
default:
class: File
path: code/script.py
secondaryFiles:
- class: File
path: code/file1
...
Running the tool shows that script.py
and file1
are put in the same directory, meaning that accessing file1
in script.py
directly would work this time:
...
[job tool.cwl] /tmp/tmp93wzl5qq$ python3 \
/tmp/tmp1trt9092/stg25011c1d-092b-426c-88f2-e41c38466df8/script.py \
/tmp/tmp1trt9092/stg25011c1d-092b-426c-88f2-e41c38466df8/file1 \
/tmp/tmp1trt9092/stg8c6e7b28-2820-452b-9e1c-7d77d449eb64/file2 \
/tmp/tmp1trt9092/stge5748d29-a793-40f1-82eb-b4fef27988d9/file3 \
/tmp/tmp1trt9092/stg12046ade-d0cc-4e21-a20e-c1eac687214c/file4 \
/tmp/tmp1trt9092/stg27c348a7-7d82-43fa-958e-34095665202e/file5 \
output/file6
...
However, this approach does not preserve directory structure. For example, if we also add file2
to secondaryFiles
:
...
inputs:
input_1:
default:
class: File
path: code/script.py
secondaryFiles:
- class: File
path: code/file1
- class: File
path: code/submodule/file2
...
The execution output shows that altough file2
is put in the same directory as script.py
and file1
, the submodule
sub-directory is not preserved.
...
[job tool.cwl] /tmp/tmpdz77g6ya$ python3 \
/tmp/tmpq5_zgx9h/stgab7023c7-4c60-4650-9b42-cb976b30d79a/script.py \
/tmp/tmpq5_zgx9h/stgab7023c7-4c60-4650-9b42-cb976b30d79a/file1 \
/tmp/tmpq5_zgx9h/stgab7023c7-4c60-4650-9b42-cb976b30d79a/file2 \
/tmp/tmpq5_zgx9h/stg6c37035c-dfff-46cd-a588-65b0a27443ec/file3 \
/tmp/tmpq5_zgx9h/stg9906bfc5-7c3a-46e6-8da2-4dde78fbc47f/file4 \
/tmp/tmpq5_zgx9h/stge5cf15b3-8a2f-4114-aba8-c604a6c5c6f7/file5 \
output/file6
...
This time, we define the code
directory as an input to the tool with no inputBinding
; we also remove all secondaryFiles
.
tool.cwl
...
input_1:
default:
class: File
path: code/script.py
inputBinding:
position: 1
type: File
...
input_8:
default:
class: Directory
path: code
inputBinding:
type: Directory
...
The execution result shows that the directory structure for the code
directory is preserved:
...
[job tool.cwl] /tmp/tmpb6wp_tyg$ python3 \
/tmp/tmpgy_mugtl/stgf353b6af-5572-49c6-9e63-a8f8cf624ccf/code/script.py \
/tmp/tmpgy_mugtl/stgf353b6af-5572-49c6-9e63-a8f8cf624ccf/code/file1 \
/tmp/tmpgy_mugtl/stgf353b6af-5572-49c6-9e63-a8f8cf624ccf/code/submodule/file2 \
/tmp/tmpgy_mugtl/stg64d907fb-9f7e-49fa-ba95-d0748321c4e0/file3 \
/tmp/tmpgy_mugtl/stg16605204-d5f8-4eb8-9698-10e4e4aa07fc/file4 \
/tmp/tmpgy_mugtl/stgfef5bed7-2c74-422a-878f-ced52fb3c3f9/file5 \
output/file6
...
This is like the previous case, with an extra inclusion of the code
directory in InitialWorkDirRequirement
field. This is the same solution that was proposed by @ableuler.
tool.cwl
...
- class: InitialWorkDirRequirement
listing:
- entry: '$({"listing": [], "class": "Directory"})'
entryname: output
writable: true
- $(inputs.input_8)
...
Like before the code
directory hierarchy is preserved, but this time the files are put in the designated output directory:
...
[job tool.cwl] /tmp/tmpsgz5hil2$ python3 \
/tmp/tmpsgz5hil2/code/script.py \
/tmp/tmpsgz5hil2/code/file1 \
/tmp/tmpsgz5hil2/code/submodule/file2 \
/tmp/tmpu0_a62ok/stg4b4db527-0ed6-47db-b356-ff2c87bbb18b/file3 \
/tmp/tmpu0_a62ok/stgfe4cdcd4-4f69-4d77-b71c-ea09e6188418/file4 \
/tmp/tmpu0_a62ok/stg567b03e5-287d-4ec3-9591-d277ce8c43f8/file5 \
output/file6
************************
DESIGNATED OUTPUT DIRECTORY: /tmp/tmpsgz5hil2
DESIGNATED TEMPORARY DIRECTORY: /tmp/tmpsxkksb24
************************
...
Listing of the designated output directory shows that a soft link created to the code
directory:
$ ls -l /tmp/tmpsgz5hil2
total 4
lrwxrwxrwx 1 mohammad mohammad 54 Jul 26 13:02 code -> /home/mohammad/various/comprehensive-cwl-examaple/code
drwxr-xr-x 2 mohammad mohammad 4096 Jul 26 13:02 output
This means that we can also access output files directly in the source code.
This solution works both with files and directories.
A comprehensive solution, should use --input
(this is the same as --depends-on
in Andreas' comment) with all the files and top-level directories in the current folder. For our example, it would be like:
renku run --input code --input data --input file5 code/script.py
Let's rewrite script.py
to use hardcoded dependencies:
import os
print("************************")
print("DESIGNATED OUTPUT DIRECTORY:", os.getenv("HOME"))
print("DESIGNATED TEMPORARY DIRECTORY:", os.getenv("TMPDIR"))
print("************************")
files = [
"code/file1",
"code/submodule/file2",
"data/files/file3",
"data/files/file4",
"file5",
"output/file6"
]
with open(files[-1], 'w') as output:
for name in files[0:-1]:
with open(name) as input:
for line in input:
output.write(line)
tool.cwl
arguments: []
baseCommand:
- python3
class: CommandLineTool
cwlVersion: v1.0
hints: []
inputs:
input_1:
default:
class: File
path: code/script.py
inputBinding:
position: 1
type: File
input_2:
default:
class: Directory
path: code
inputBinding:
type: Directory
input_3:
default:
class: Directory
path: data
inputBinding:
type: Directory
input_4:
default:
class: File
path: file5
inputBinding:
type: File
input_5:
default: output/file6
type: string
outputs:
output_0:
outputBinding:
glob: $(inputs.input_5)
type: File
output_1:
outputBinding:
glob: output
type: Directory
permanentFailCodes: []
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entry: '$({"listing": [], "class": "Directory"})'
entryname: output
writable: true
- $(inputs.input_2)
- $(inputs.input_3)
- $(inputs.input_4)
successCodes: []
temporaryFailCodes: []
Executing the tools succeeds with no issue:
$ cwl-runner tool.cwl
/usr/local/bin/cwl-runner 1.0.20181012180214
Resolved 'tool.cwl' to 'file:///home/mohammad/various/comprehensive-cwl-examaple/tool.cwl'
[job tool.cwl] /tmp/tmpimcweimm$ python3 \
/tmp/tmpimcweimm/code/script.py
************************
DESIGNATED OUTPUT DIRECTORY: /tmp/tmpimcweimm
DESIGNATED TEMPORARY DIRECTORY: /tmp/tmp4s_7r2cp
************************
[job tool.cwl] completed success
{
"output_0": {
"location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
"basename": "file6",
"class": "File",
"checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
"size": 30,
"path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
},
"output_1": {
"location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output",
"basename": "output",
"class": "Directory",
"listing": [
{
"class": "File",
"location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
"basename": "file6",
"checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
"size": 30,
"path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
}
],
"path": "/home/mohammad/various/comprehensive-cwl-examaple/output"
}
}
Final process status is success
To solve the problem with sub-directory inclusion with can use entryname
in InitialWorkDirRequirement
entries. This will create the desired directory structure in the designated output directory before creating symlinks to the file/directories. For example, if we have --input data/files/file3
in the command line, the generated CWL tool looks like this:
...
inputs:
...
input_3:
default:
class: File
path: data/files/file3
inputBinding:
type: File
...
- class: InitialWorkDirRequirement
listing:
...
- entry: $(inputs.input_3)
entryname: data/files/files3
...
This solution works most of the time except when there is a write to a hardcoded output, which might not be caught by the cwl-runner
(due to symbolic link resolving). This might not be a big deal since CWL recommends to pass outputs as command line arguments. Moreover, we can get over this problem by passing writable: true
to each InitialWorkDirRequirement
entry, but the drawback is that it always copies all inputs files to the designated output directory instead of creating symlinks. This is not desirable for large input files.
Sounds great. But are we taking advantage of an oversight from CWL (allowing hardcoded inputs), that could become fixed in a future release to enforce best practices?
And can we have multiple
--depends-on
, (could be secondaryFile), if we want to span several dirs or files, e.g. I imagine that requirements.txt and environment.yml would be good to have by defaults.
@erbou I don't think that we're exploiting any loopholes here which will soon be closed. If at all the secondaryFile
property seems closest to being one and even that is used in many examples so I don't expect this to disappear soon.
And yes, we definitely should allow for multiple --depends-on
or --input
flags.
Thanks @mohammad-sdsc for this comprehensive review of the options that CWL offers us. Also, I wasn't aware of the entryname
option which should be very helpful 👍 .
This was resolved by #598
Is your feature request related to a problem? Please describe.
An output is not marked as outdated if script dependencies have changed.
Describe the solution you'd like
Detect dependencies in source script and add them to
secondaryFiles
section.(see https://doc.arvados.org/user/cwl/cwl-style.html)
ProcessRun
)Additional context
Q: How do we want to detect these secondary files (e.g.
renku run python script.py ...
)?Q: Do we want users to be able to specify secondary files for a given input file? (
renku run --secondary script.py:src/ python script.py
)