SwissDataScienceCenter / renku-python

A Python library for the Renku collaborative data science platform.
https://renku-python.readthedocs.io/
Apache License 2.0
37 stars 29 forks source link

cwl: support secondary files in inputs #311

Closed jirikuncar closed 5 years ago

jirikuncar commented 5 years ago

Is your feature request related to a problem? Please describe.

An output is not marked as outdated if script dependencies have changed.

Describe the solution you'd like

Detect dependencies in source script and add them to secondaryFiles section.

(see https://doc.arvados.org/user/cwl/cwl-style.html)

Additional context

Q: How do we want to detect these secondary files (e.g. renku run python script.py ...)?

Q: Do we want users to be able to specify secondary files for a given input file? (renku run --secondary script.py:src/ python script.py)

rokroskar commented 5 years ago

I think users would prefer less cryptic naming here... something like renku run --input src/ --output out/ python script.py

ableuler commented 5 years ago

I'm still reading more on CWL, but here's my current status on this issue:

I don't think that this secondaryFile option will allow us to specify generic non-input dependencies. I have the impression that CWL deliberately chooses not to support this and instead encourages users to write their tools in such a way that all inputs are passed in as arguments.

The secondaryFile option seems to be more like a backdoor to enable very frequent but very specific use cases from bio-informatics, where a bunch of secondaryFiles are piggy-backing on an actual input argument. However, secondaryFile can not exist on their own.

Allowing users to specify secondaryFiles through Renku would certainly be possible. In this case I would suggest to simply assume that if the primary file has not changed, the secondary files won't have changed either. I think this assumption would be very much in the spirit of how secondary files are specified in CWL. But I'm not sure how useful this feature would currently be.

ableuler commented 5 years ago

Ok, I think the clean way of having non-input dependencies in terms of CWL is to still define these extra dependencies as inputs but without the (inputBinding). As @mohammad-sdsc has pointed out, this will prevent the inputs from being passed to the command line as an argument. In this case the input is still staged for the cwl execution and the temporary path can be accessed using $(inputs.inputFile.path) and for example exported as an environment variable. Then the code would have to be written such that it picks up non-input dependencies from env variables. However, for the example of a python module which should be importable it could also be enough to just add the folder with the to-be-imported module to the PYTHONPATH. I have a better picture now of how solution to this could be implemented, but I'm still not sure about th use case(s) and the expected behaviour.

ableuler commented 5 years ago

Ok, I have another solution which comes closest to the desired general support of relative paths. It consists of defining an entire directory as an input, not binding this input to the command and finally using the InitialWorkDirRequirement on this input directory. The last requirements causes the input directory to be staged in the working directory directly. Look at this minimal working example:

tool.cwl

cwlVersion: v1.0
class: CommandLineTool
baseCommand: ["python", "code/script.py"]
inputs:
  codeDir:
    type: Directory
outputs: []
requirements:
  InitialWorkDirRequirement:
    listing:
      - $(inputs.codeDir)

job.yml

codeDir:
  class: Directory
  path: code

content of code directory

|____code
| |____script.py
| |____myLib
| | |____helloWorld.py

helloWorld.py

def helloWorld():
    return "Hello World"

script.py

from myLib.helloWorld import helloWorld
print(helloWorld())

This could be invoked with renku run --depends-on ./code .... When building the KG or deciding whether to re-execute a step during renku update, the code directory would just be considered a regular input. The drawback of this is that things like --depends-on ../../some-dir or --depends-on ./code/subfolder/some-dir will both not work as the cwl runner would stage some-dir in the workdir.

erbou commented 5 years ago

Sounds great. But are we taking advantage of an oversight from CWL (allowing hardcoded inputs), that could become fixed in a future release to enforce best practices?

And can we have multiple --depends-on, (could be secondaryFile), if we want to span several dirs or files, e.g. I imagine that requirements.txt and environment.yml would be good to have by defaults.

m-alisafaee commented 5 years ago

Some CWL Basics (according to CWL Command Line Tool Description, v1.0.2)

A Workflow Platform is where cwl-runner is executing. It interprets the CWL document and sets up the Runtime Environment.

A Runtime Environment is where the command line tool is executing. It is created by cwl-runner before the execution of the command line tool.

The Runtime Environment has a designated output directory where output files produced by tool execution are written into. It also has a designated temporary directory which can also be used as a writable directory. When the tool terminated content of this directory are not considered accessible. The system temporary directory (i.e. /tmp) is also available to the tool for writting temporary files. The tool cannot write to anyother directory in the Runtime Environment other than these 3 directories.

The tool is executed in a new, empty environment where only a few environment variables are defined. Two important ones are HOME which points to the absolute path of the designated output directory and TMPDIR which points to the absolute path of the designated temporary directory.

When the execution of the tool starts, the current working directory is set to designated output directory.

The location of inputs of the tool is not defined by CWL and is implementation-dependent. Moreover, CWL does not define any mechanism (e.g. environment variables) to locate inputs. However, there are two guranttees regarding the relative location of inputs:

To visualize things consider the following directory entry:

.
├── code
│   ├── file1
│   ├── script.py
│   └── submodule
│       └── file2
├── data
│   └── files
│       ├── file3
│       └── file4
├── file5
├── output
│   └── file6
└── tool.cwl

script.py is an implementation of cat command in Python where it puts the content of all input file arguments excluding the last one in the last one. It also reads and print "HOME" and "OUTDIR" environment variables:

import os
import sys

print("************************")
print("DESIGNATED OUTPUT DIRECTORY:", os.getenv("HOME"))
print("DESIGNATED TEMPORARY DIRECTORY:", os.getenv("TMPDIR"))
print("************************")
assert len(sys.argv) >= 3

with open(sys.argv[-1], 'w') as output:
    for name in sys.argv[1:-1]:
        with open(name) as input:
            for line in input:
                output.write(line)

The tool.cwl runs python3 code/script.py code/file1 code/submodule/file2 data/files/file3 data/files/file4 output/file6 which basically concatenates contents of file1, file2, file3, and file4 and puts the result in file5. This is how it looks like in CWL:

arguments: []
baseCommand:
- python3
class: CommandLineTool
cwlVersion: v1.0
hints: []
inputs:
  input_1:
    default:
      class: File
      path: code/script.py
    inputBinding:
      position: 1
    type: File
  input_2:
    default:
      class: File
      path: code/file1
    inputBinding:
      position: 2
    type: File
  input_3:
    default:
      class: File
      path: code/submodule/file2
    inputBinding:
      position: 3
    type: File
  input_4:
    default:
      class: File
      path: data/files/file3
    inputBinding:
      position: 4
    type: File
  input_5:
    default:
      class: File
      path: data/files/file4
    inputBinding:
      position: 5
    type: File
  input_6:
    default:
      class: File
      path: file5
    inputBinding:
      position: 6
    type: File
  input_7:
    default: output/file6
    inputBinding:
      position: 7
    type: string
outputs:
  output_0:
    outputBinding:
      glob: $(inputs.input_7)
    type: File
  output_1:
    outputBinding:
      glob: output
    type: Directory
permanentFailCodes: []
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
  listing:
  - entry: '$({"listing": [], "class": "Directory"})'
    entryname: output
    writable: true
successCodes: []
temporaryFailCodes: []

Running with CWL Runner

I used reference cwl-runner to run the tool.

$ cwl-runner tool.cwl

/usr/local/bin/cwl-runner 1.0.20181012180214
Resolved 'tool.cwl' to 'file:///home/mohammad/various/comprehensive-cwl-examaple/tool.cwl'
[job tool.cwl] /tmp/tmp3o5evbhd$ python3 \
    /tmp/tmpgcpuaxj2/stg5d098d76-6f49-4a65-ab38-aca6f4aaa7c2/script.py \
    /tmp/tmpgcpuaxj2/stge21109cb-71b9-4548-9933-b84284323784/file1 \
    /tmp/tmpgcpuaxj2/stg1dfecd9c-a60c-430e-877d-8b9ab6cfa2c7/file2 \
    /tmp/tmpgcpuaxj2/stgab85c2ab-6535-43c8-b2d2-fef559d5fbe9/file3 \
    /tmp/tmpgcpuaxj2/stg878ad716-a1fc-49ef-af09-0d94484950b6/file4 \
    /tmp/tmpgcpuaxj2/stgcf970a58-3b10-46a2-81b7-1dc16fa2222c/file5 \
    output/file6
************************
DESIGNATED OUTPUT DIRECTORY: /tmp/tmp3o5evbhd
DESIGNATED TEMPORARY DIRECTORY: /tmp/tmpk1b_5maz
************************
[job tool.cwl] completed success
{
    "output_0": {
        "location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
        "basename": "file5",
        "class": "File",
        "checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
        "size": 30,
        "path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
    },
    "output_1": {
        "location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output",
        "basename": "output",
        "class": "Directory",
        "listing": [
            {
                "class": "File",
                "location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
                "basename": "file5",
                "checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
                "size": 24,
                "path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
            }
        ],
        "path": "/home/mohammad/various/comprehensive-cwl-examaple/output"
    }
}
Final process status is success

The cwl-runner creates the runtime environment to execute the tool and prints the command which it uses to run the tool:

...
/tmp/tmp3o5evbhd$ python3 \
    /tmp/tmpgcpuaxj2/stg5d098d76-6f49-4a65-ab38-aca6f4aaa7c2/script.py \
    /tmp/tmpgcpuaxj2/stge21109cb-71b9-4548-9933-b84284323784/file1 \
    /tmp/tmpgcpuaxj2/stg1dfecd9c-a60c-430e-877d-8b9ab6cfa2c7/file2 \
    /tmp/tmpgcpuaxj2/stgab85c2ab-6535-43c8-b2d2-fef559d5fbe9/file3 \
    /tmp/tmpgcpuaxj2/stg878ad716-a1fc-49ef-af09-0d94484950b6/file4 \
    /tmp/tmpgcpuaxj2/stg078bb41c-d670-4e4e-a845-48e086951871/file5 \
    output/file6

As you can see in this commnad, each input file is put in a different directory regardless of the initial input directory structure. For example, script.py and file1 are in the same directory in the Workflow Platform's filesystem, but in the Runtime Environment one is put in /tmp/tmpgcpuaxj2/stg5d098d76-6f49-4a65-ab38-aca6f4aaa7c2/ and the other is put in /tmp/tmpgcpuaxj2/stge21109cb-71b9-4548-9933-b84284323784. This means that if we access file1 directly in script.py, we will get a FileNotFound error when executing this tool.

In general, the CWL does not specify if runners should preserve the directory structure for input files and we cannot assume that runner implementations will do so.

Also note that the designated output directory which is /tmp/tmp3o5evbhd is different than the directory where contains inputs, /tmp/tmpgcpuaxj2. This means that, we cannot access output files directly in script.py either.

Using secondaryFiles Field

If we add file1 as a secondaryFiles to script.py (everything else remain the same):

...
inputs:
  input_1:
    default:
      class: File
      path: code/script.py
      secondaryFiles:
        - class: File
          path: code/file1
...

Running the tool shows that script.py and file1 are put in the same directory, meaning that accessing file1 in script.py directly would work this time:

...
[job tool.cwl] /tmp/tmp93wzl5qq$ python3 \
    /tmp/tmp1trt9092/stg25011c1d-092b-426c-88f2-e41c38466df8/script.py \
    /tmp/tmp1trt9092/stg25011c1d-092b-426c-88f2-e41c38466df8/file1 \
    /tmp/tmp1trt9092/stg8c6e7b28-2820-452b-9e1c-7d77d449eb64/file2 \
    /tmp/tmp1trt9092/stge5748d29-a793-40f1-82eb-b4fef27988d9/file3 \
    /tmp/tmp1trt9092/stg12046ade-d0cc-4e21-a20e-c1eac687214c/file4 \
    /tmp/tmp1trt9092/stg27c348a7-7d82-43fa-958e-34095665202e/file5 \
    output/file6
...

However, this approach does not preserve directory structure. For example, if we also add file2 to secondaryFiles:

...
inputs:
  input_1:
    default:
      class: File
      path: code/script.py
      secondaryFiles:
        - class: File
          path: code/file1
        - class: File
          path: code/submodule/file2
...

The execution output shows that altough file2 is put in the same directory as script.py and file1, the submodule sub-directory is not preserved.

...
[job tool.cwl] /tmp/tmpdz77g6ya$ python3 \
    /tmp/tmpq5_zgx9h/stgab7023c7-4c60-4650-9b42-cb976b30d79a/script.py \
    /tmp/tmpq5_zgx9h/stgab7023c7-4c60-4650-9b42-cb976b30d79a/file1 \
    /tmp/tmpq5_zgx9h/stgab7023c7-4c60-4650-9b42-cb976b30d79a/file2 \
    /tmp/tmpq5_zgx9h/stg6c37035c-dfff-46cd-a588-65b0a27443ec/file3 \
    /tmp/tmpq5_zgx9h/stg9906bfc5-7c3a-46e6-8da2-4dde78fbc47f/file4 \
    /tmp/tmpq5_zgx9h/stge5cf15b3-8a2f-4114-aba8-c604a6c5c6f7/file5 \
    output/file6
...

Defining Directories as Inputs

This time, we define the code directory as an input to the tool with no inputBinding; we also remove all secondaryFiles.

tool.cwl

...
  input_1:
    default:
      class: File
      path: code/script.py
    inputBinding:
      position: 1
    type: File
...
  input_8:
    default:
      class: Directory
      path: code
    inputBinding:
    type: Directory
...

The execution result shows that the directory structure for the code directory is preserved:

...
[job tool.cwl] /tmp/tmpb6wp_tyg$ python3 \
    /tmp/tmpgy_mugtl/stgf353b6af-5572-49c6-9e63-a8f8cf624ccf/code/script.py \
    /tmp/tmpgy_mugtl/stgf353b6af-5572-49c6-9e63-a8f8cf624ccf/code/file1 \
    /tmp/tmpgy_mugtl/stgf353b6af-5572-49c6-9e63-a8f8cf624ccf/code/submodule/file2 \
    /tmp/tmpgy_mugtl/stg64d907fb-9f7e-49fa-ba95-d0748321c4e0/file3 \
    /tmp/tmpgy_mugtl/stg16605204-d5f8-4eb8-9698-10e4e4aa07fc/file4 \
    /tmp/tmpgy_mugtl/stgfef5bed7-2c74-422a-878f-ced52fb3c3f9/file5 \
    output/file6
...

Defining Directories as Inputs and Using InitialWorkDirRequirement

This is like the previous case, with an extra inclusion of the code directory in InitialWorkDirRequirement field. This is the same solution that was proposed by @ableuler.

tool.cwl

...
- class: InitialWorkDirRequirement
  listing:
  - entry: '$({"listing": [], "class": "Directory"})'
    entryname: output
    writable: true
  - $(inputs.input_8)
...

Like before the code directory hierarchy is preserved, but this time the files are put in the designated output directory:

...
[job tool.cwl] /tmp/tmpsgz5hil2$ python3 \
    /tmp/tmpsgz5hil2/code/script.py \
    /tmp/tmpsgz5hil2/code/file1 \
    /tmp/tmpsgz5hil2/code/submodule/file2 \
    /tmp/tmpu0_a62ok/stg4b4db527-0ed6-47db-b356-ff2c87bbb18b/file3 \
    /tmp/tmpu0_a62ok/stgfe4cdcd4-4f69-4d77-b71c-ea09e6188418/file4 \
    /tmp/tmpu0_a62ok/stg567b03e5-287d-4ec3-9591-d277ce8c43f8/file5 \
    output/file6
************************
DESIGNATED OUTPUT DIRECTORY: /tmp/tmpsgz5hil2
DESIGNATED TEMPORARY DIRECTORY: /tmp/tmpsxkksb24
************************
...

Listing of the designated output directory shows that a soft link created to the code directory:

$ ls -l /tmp/tmpsgz5hil2
total 4
lrwxrwxrwx 1 mohammad mohammad   54 Jul 26 13:02 code -> /home/mohammad/various/comprehensive-cwl-examaple/code
drwxr-xr-x 2 mohammad mohammad 4096 Jul 26 13:02 output

This means that we can also access output files directly in the source code.

This solution works both with files and directories.

m-alisafaee commented 5 years ago

A comprehensive solution, should use --input (this is the same as --depends-on in Andreas' comment) with all the files and top-level directories in the current folder. For our example, it would be like:

renku run --input code --input data --input file5 code/script.py

Let's rewrite script.py to use hardcoded dependencies:

import os

print("************************")
print("DESIGNATED OUTPUT DIRECTORY:", os.getenv("HOME"))
print("DESIGNATED TEMPORARY DIRECTORY:", os.getenv("TMPDIR"))
print("************************")

files = [
    "code/file1",
    "code/submodule/file2",
    "data/files/file3",
    "data/files/file4",
    "file5",
    "output/file6"
]

with open(files[-1], 'w') as output:
    for name in files[0:-1]:
        with open(name) as input:
            for line in input:
                output.write(line)

tool.cwl

arguments: []
baseCommand:
- python3
class: CommandLineTool
cwlVersion: v1.0
hints: []
inputs:
  input_1:
    default:
      class: File
      path: code/script.py
    inputBinding:
      position: 1
    type: File
  input_2:
    default:
      class: Directory
      path: code
    inputBinding:
    type: Directory
  input_3:
    default:
      class: Directory
      path: data
    inputBinding:
    type: Directory
  input_4:
    default:
      class: File
      path: file5
    inputBinding:
    type: File
  input_5:
    default: output/file6
    type: string
outputs:
  output_0:
    outputBinding:
      glob: $(inputs.input_5)
    type: File
  output_1:
    outputBinding:
      glob: output
    type: Directory
permanentFailCodes: []
requirements:
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
  listing:
  - entry: '$({"listing": [], "class": "Directory"})'
    entryname: output
    writable: true
  - $(inputs.input_2)
  - $(inputs.input_3)
  - $(inputs.input_4)
successCodes: []
temporaryFailCodes: []

Executing the tools succeeds with no issue:

$ cwl-runner tool.cwl 
/usr/local/bin/cwl-runner 1.0.20181012180214
Resolved 'tool.cwl' to 'file:///home/mohammad/various/comprehensive-cwl-examaple/tool.cwl'
[job tool.cwl] /tmp/tmpimcweimm$ python3 \
    /tmp/tmpimcweimm/code/script.py
************************
DESIGNATED OUTPUT DIRECTORY: /tmp/tmpimcweimm
DESIGNATED TEMPORARY DIRECTORY: /tmp/tmp4s_7r2cp
************************
[job tool.cwl] completed success
{
    "output_0": {
        "location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
        "basename": "file6",
        "class": "File",
        "checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
        "size": 30,
        "path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
    },
    "output_1": {
        "location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output",
        "basename": "output",
        "class": "Directory",
        "listing": [
            {
                "class": "File",
                "location": "file:///home/mohammad/various/comprehensive-cwl-examaple/output/file6",
                "basename": "file6",
                "checksum": "sha1$9cc4fcaa003e6d487d24baf4481f2c8b3544265d",
                "size": 30,
                "path": "/home/mohammad/various/comprehensive-cwl-examaple/output/file6"
            }
        ],
        "path": "/home/mohammad/various/comprehensive-cwl-examaple/output"
    }
}
Final process status is success
m-alisafaee commented 5 years ago

To solve the problem with sub-directory inclusion with can use entryname in InitialWorkDirRequirement entries. This will create the desired directory structure in the designated output directory before creating symlinks to the file/directories. For example, if we have --input data/files/file3 in the command line, the generated CWL tool looks like this:

...
inputs:
  ...
  input_3:
    default:
      class: File
      path: data/files/file3
    inputBinding:
    type: File
...
- class: InitialWorkDirRequirement
  listing:
  ...
  - entry: $(inputs.input_3)
    entryname: data/files/files3
...

This solution works most of the time except when there is a write to a hardcoded output, which might not be caught by the cwl-runner (due to symbolic link resolving). This might not be a big deal since CWL recommends to pass outputs as command line arguments. Moreover, we can get over this problem by passing writable: true to each InitialWorkDirRequirement entry, but the drawback is that it always copies all inputs files to the designated output directory instead of creating symlinks. This is not desirable for large input files.

ableuler commented 5 years ago

Sounds great. But are we taking advantage of an oversight from CWL (allowing hardcoded inputs), that could become fixed in a future release to enforce best practices?

And can we have multiple --depends-on, (could be secondaryFile), if we want to span several dirs or files, e.g. I imagine that requirements.txt and environment.yml would be good to have by defaults.

@erbou I don't think that we're exploiting any loopholes here which will soon be closed. If at all the secondaryFile property seems closest to being one and even that is used in many examples so I don't expect this to disappear soon. And yes, we definitely should allow for multiple --depends-on or --input flags.

ableuler commented 5 years ago

Thanks @mohammad-sdsc for this comprehensive review of the options that CWL offers us. Also, I wasn't aware of the entryname option which should be very helpful 👍 .

rokroskar commented 5 years ago

This was resolved by #598