Open jotasi opened 2 years ago
Hello @jotasi and thanks for your issue
Expected Behavior
When an output of type
File
is defined in aCommandLineTool
, it should be located directly in the specified--outdir
even if it is also part of another output of typeDirectory
(in that case, two copies should exist). I understand that this might not be desired to avoid having two copies of the file. However, even when you want to avoid the duplication, when renaming the output of typeFile
by changing itsbasename
in an expression, not only thebasename
but also thelocation
andpath
should reflect the newbasename
as otherwise the output object provided for the file is inconsistent.
Can you give us a bit more background and context on what you are trying to achieve and why you expect/want this other behavior?
About renaming: we could add a flag to normalize the path
s if there have been any last-minute renamings; this could result in duplication if the same file appeared in multiple places, but with different names. I guess the flag could let you specify if you wanted copies made, symbolic links, or hard links..
Hi @mr-c and thanks for the quick reply.
My use case would be that I have a tool that should become part of a pipeline. This tool generates output in a dedicated directory. For the further pipeline, I only need some of the files that are generated in the directory. These I would thus track as outputs of type File
so that I don't have to search for them in the directory via an expression before being able to provide them to the next step. However, I would also like to track all other files in the output directory (i.e. also those that are not directly necessary for the rest of the pipeline), mainly for reference and potentially debugging. That's why I also want to track the full directory as an output of type Directory
.
As I use the output(s) of type File
in the further pipeline, it would be handy to have them in the output directory directly, instead of within the directory to be able to find them more easily when looking at the in- and outputs of the main steps of the pipeline after execution (and as the output(s) of this step are also a secondary output of the pipeline).
The renaming, I would like to do as the step I'm actually performing is run on multiple samples and I want to rename the file to reflect the sample the output originated from for easier attribution.
Setting an additional flag would be fine. In my particular case, the files are reasonably small so I don't really mind making copies, but soft or hard links would be fine as well.
Hi @mr-c and thanks for the quick reply.
:+1:
However, I would also like to track all other files in the output directory (i.e. also those that are not directly necessary for the rest of the pipeline), mainly for reference and potentially debugging. That's why I also want to track the full directory as an output of type
Directory
.
For development and debugging purposes, I recommend using cwltool --cachedir
followed by a path where it will cache the results of all steps (and speed up future executions when you run again).
That sounds quite helpful for local development and debugging. Thanks.
I would probably still like to store the directory as an output though as I wan to run the same workflow also with different runners e.g. online on the SBG platform and would want to store the directory for these executions for future reference as well (as I might want to look at the additional outputs produced by the tool in future).
Thanks for the context. I agree that this would be nice to have. The fast fix would be to write a script (in the language of your choice) that takes the output JSON from thecwltool
(standard, not error) output and modifies the output directory contents to your own liking. We would, of course, welcome a pull request to cwltool
to add the command line flag to natively satisfy your needs.
I would probably still like to store the directory as an output though as I wan to run the same workflow also with different runners e.g. online on the SBG platform and would want to store the directory for these executions for future reference as well (as I might want to look at the additional outputs produced by the tool in future).
FYI: my memory is that SBG and other cloud based systems store the entire output directory (for at least a while)
https://docs.sevenbridges.com/docs/about-memoization#intermediate-files says they default to 24 hours, with a max of 5 days. So I guess you might want to keep them longer.
For Arvados, this is configured by the arv:IntermediateOutput
value, which defaults to not deleting intermediates automatically (but that can be done manually, or by setting a timeout value)
Expected Behavior
When an output of type
File
is defined in aCommandLineTool
, it should be located directly in the specified--outdir
even if it is also part of another output of typeDirectory
(in that case, two copies should exist). I understand that this might not be desired to avoid having two copies of the file. However, even when you want to avoid the duplication, when renaming the output of typeFile
by changing itsbasename
in an expression, not only thebasename
but also thelocation
andpath
should reflect the newbasename
as otherwise the output object provided for the file is inconsistent.Actual Behavior
When generating a file within a directory and then tracking both the directory and the file as two separate outputs of one
CommandLineTool
(one of typeDirectory
and one of typeFile
), the output of typeFile
no longer is stored directly within the output directory but only within the directory structure provided by the output of typeDirectory
(i.e. only one copy of the file is tracked as output within the directory). This is independent of whether the file is renamed or not. Furthermore, renaming the output of typeFile
by changing thebasename
in anoutputEval
expression causes an inconsistent output type with thebasename
being changed but thelocation
andpath
(as well as the actual physical location of the file) still corresponding to the old filename.Workflow Code
with input
provides the following output:
When removing the output of type
Directory
, the file is directly in the output dir and renamed as expected (same input):produces:
Full Traceback
N/A
Your Environment