common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
335 stars 230 forks source link

Inconsistent location and name for renamed output of type File when also included in output of type Directory for CLT #1628

Open jotasi opened 2 years ago

jotasi commented 2 years ago

Expected Behavior

When an output of type File is defined in a CommandLineTool, it should be located directly in the specified --outdir even if it is also part of another output of type Directory (in that case, two copies should exist). I understand that this might not be desired to avoid having two copies of the file. However, even when you want to avoid the duplication, when renaming the output of type File by changing its basename in an expression, not only the basename but also the location and path should reflect the new basename as otherwise the output object provided for the file is inconsistent.

Actual Behavior

When generating a file within a directory and then tracking both the directory and the file as two separate outputs of one CommandLineTool (one of type Directory and one of type File), the output of type File no longer is stored directly within the output directory but only within the directory structure provided by the output of type Directory (i.e. only one copy of the file is tracked as output within the directory). This is independent of whether the file is renamed or not. Furthermore, renaming the output of type File by changing the basename in an outputEval expression causes an inconsistent output type with the basename being changed but the location and path (as well as the actual physical location of the file) still corresponding to the old filename.

Workflow Code

class: 'CommandLineTool'
cwlVersion: 'v1.2'

requirements:
  - class: 'ShellCommandRequirement'
  - class: 'InlineJavascriptRequirement'

baseCommand: ['mkdir', 'test']

inputs:
  - id: 'dummy'
    type: 'string'

arguments:
  - id: 'connect'
    valueFrom: '&&'
    shellQuote: false
  - 'touch'
  - 'test/test_file.txt'

outputs:
  - id: 'output_dir'
    type: 'Directory'
    outputBinding:
      glob: 'test'
  - id: 'output_file'
    type: 'File'
    outputBinding:
      glob: 'test/test_file.txt'
      outputEval: |-
        ${
            self[0].basename='renamed.txt';
            return self
        }

with input

dummy: 'foobar'

provides the following output:

...
   mkdir test && touch test/test_file.txt
INFO [job mcve_dir_and_rename_file.cwl] completed success
{
    "output_dir": {
        "location": "file://<path_to_output_dir>/test",
        "basename": "test",
        "class": "Directory",
        "listing": [
            {
                "class": "File",
                "location": "file://<path_to_output_dir>/test/test_file.txt",
                "basename": "test_file.txt",
                "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
                "size": 0,
                "path": "<path_to_output_dir>/test/test_file.txt"
            }
        ],
        "path": "<path_to_output_dir>/test"
    },
    "output_file": {
        "location": "file://<path_to_output_dir>/test/test_file.txt",
        "basename": "renamed.txt",
        "class": "File",
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "size": 0,
        "path": "<path_to_output_dir>/test/test_file.txt"
    }
}
INFO Final process status is success

When removing the output of type Directory, the file is directly in the output dir and renamed as expected (same input):

class: 'CommandLineTool'
cwlVersion: 'v1.2'

requirements:
  - class: 'ShellCommandRequirement'
  - class: 'InlineJavascriptRequirement'

baseCommand: ['mkdir', 'test']

inputs:
  - id: 'dummy'
    type: 'string'

arguments:
  - id: 'connect'
    valueFrom: '&&'
    shellQuote: false
  - 'touch'
  - 'test/test_file.txt'

outputs:
  - id: 'output_file'
    type: 'File'
    outputBinding:
      glob: 'test/test_file.txt'
      outputEval: |-
        ${
            self[0].basename='renamed.txt';
            return self
        }

produces:

...
    mkdir test && touch test/test_file.txt
INFO [job mcve_rename_file.cwl] completed success
{
    "output_file": {
        "location": "file://<path_to_output_dir>/renamed.txt",
        "basename": "renamed.txt",
        "class": "File",
        "checksum": "sha1$da39a3ee5e6b4b0d3255bfef95601890afd80709",
        "size": 0,
        "path": "<path_to_output_dir>/renamed.txt"
    }
}
INFO Final process status is success

Full Traceback

N/A

Your Environment

mr-c commented 2 years ago

Hello @jotasi and thanks for your issue

Expected Behavior

When an output of type File is defined in a CommandLineTool, it should be located directly in the specified --outdir even if it is also part of another output of type Directory (in that case, two copies should exist). I understand that this might not be desired to avoid having two copies of the file. However, even when you want to avoid the duplication, when renaming the output of type File by changing its basename in an expression, not only the basename but also the location and path should reflect the new basename as otherwise the output object provided for the file is inconsistent.

Can you give us a bit more background and context on what you are trying to achieve and why you expect/want this other behavior?

About renaming: we could add a flag to normalize the paths if there have been any last-minute renamings; this could result in duplication if the same file appeared in multiple places, but with different names. I guess the flag could let you specify if you wanted copies made, symbolic links, or hard links..

jotasi commented 2 years ago

Hi @mr-c and thanks for the quick reply.

My use case would be that I have a tool that should become part of a pipeline. This tool generates output in a dedicated directory. For the further pipeline, I only need some of the files that are generated in the directory. These I would thus track as outputs of type File so that I don't have to search for them in the directory via an expression before being able to provide them to the next step. However, I would also like to track all other files in the output directory (i.e. also those that are not directly necessary for the rest of the pipeline), mainly for reference and potentially debugging. That's why I also want to track the full directory as an output of type Directory.

As I use the output(s) of type File in the further pipeline, it would be handy to have them in the output directory directly, instead of within the directory to be able to find them more easily when looking at the in- and outputs of the main steps of the pipeline after execution (and as the output(s) of this step are also a secondary output of the pipeline).

The renaming, I would like to do as the step I'm actually performing is run on multiple samples and I want to rename the file to reflect the sample the output originated from for easier attribution.

Setting an additional flag would be fine. In my particular case, the files are reasonably small so I don't really mind making copies, but soft or hard links would be fine as well.

mr-c commented 2 years ago

Hi @mr-c and thanks for the quick reply.

:+1:

However, I would also like to track all other files in the output directory (i.e. also those that are not directly necessary for the rest of the pipeline), mainly for reference and potentially debugging. That's why I also want to track the full directory as an output of type Directory.

For development and debugging purposes, I recommend using cwltool --cachedir followed by a path where it will cache the results of all steps (and speed up future executions when you run again).

jotasi commented 2 years ago

That sounds quite helpful for local development and debugging. Thanks.

I would probably still like to store the directory as an output though as I wan to run the same workflow also with different runners e.g. online on the SBG platform and would want to store the directory for these executions for future reference as well (as I might want to look at the additional outputs produced by the tool in future).

mr-c commented 2 years ago

Thanks for the context. I agree that this would be nice to have. The fast fix would be to write a script (in the language of your choice) that takes the output JSON from thecwltool (standard, not error) output and modifies the output directory contents to your own liking. We would, of course, welcome a pull request to cwltool to add the command line flag to natively satisfy your needs.

I would probably still like to store the directory as an output though as I wan to run the same workflow also with different runners e.g. online on the SBG platform and would want to store the directory for these executions for future reference as well (as I might want to look at the additional outputs produced by the tool in future).

FYI: my memory is that SBG and other cloud based systems store the entire output directory (for at least a while)

https://docs.sevenbridges.com/docs/about-memoization#intermediate-files says they default to 24 hours, with a max of 5 days. So I guess you might want to keep them longer.

For Arvados, this is configured by the arv:IntermediateOutput value, which defaults to not deleting intermediates automatically (but that can be done manually, or by setting a timeout value)