Open chapmanb opened 6 years ago
Hi Brad, thanks for such a detailed report and investigation! I think based on just looking at your analysis the problem might be that cwltool moves outputs to the cwloutputs
directory above and if they happen to have the same filename they are overwritten (I may be wrong about this, but that is my assumption here :) ). Based on this, I think the --leave-outputs
could do the trick. e.g.
I have created this PR with the mod:
https://github.com/dnanexus/dx-cwl/pull/21/files
Could you give it a go and see if it helps? We could iterate from there to see what other issues this may spawn :)
Geet;
Thanks much for this, this looks like it would do the trick. It sounds like we should pull this one and then I can re-run my failing workflow and test. Do you think we should leave --outdir
as well? I wanted to make sure these got staged to a location that made sense space wise and it sounds from the docs like these two are complementary. The default is to output to the current directory so maybe we're good if it's already being run from the right place. Beyond this, I'd suggest pulling it in and we can test away. Thanks again.
Hi Brad, thanks! I went ahead and added --outdir
back in. I ran a couple of tests and I'm actually not sure what --outdir
does if --leave-outputs
is provided. Specifically, the behavior seems identical whether or not I place --outdir
on this example:
Thanks for investigating that. I can't quite tell the full command line in the first call as the screenshot only gives me half of it, but it looks like you have --outdir .
which is also the default if you don't specify --outdir
so that makes good sense and glad they don't clash. +1 for merging this and I'll be happy to run the problematic samples through to test. Thanks again.
Geet; I'm running into an issue with running workflows using files with identical final path names (in unique folders). I know we'd tackled this before for some cases. It happens within a step that don't do file downloads (just rearranging of inputs) in case that gives any clues. Here is the job that causes issues:
The inputs start off with 3 validation files with the same final path name
truth_small_variants.vcf.gz
all of which have unique references to the original files inbcbio_resources
under unique folders. (The 4th sample has a unique validation file name so is not relevant):Here's what the input files look like for a couple of them to give you an idea of the conflict with
Name
:At the next stage we have the non-unique file name prefixed by the unique file reference:
Then things start to go bad. All of the files get written over the top of each other in
cwloutputs
since they are not prefixed by a path:Then the output files reference the same internal file due to this over-writing, and this gets passed on to subsequent steps:
Thanks much for any pointers about what might be going on here and the best way to avoid. I've reached a dead end in reading the code at the point where these files appear to get staged through cwltool:
https://github.com/dnanexus/dx-cwl/blob/37290496127ad1b5346d8f630dc8361a146d9eb9/dx-cwl-applet-code.py#L119
I'm not totally sure I understand what happens here and if we can introduce some way to uniquify the outputs.
Thanks for any pointers or ideas to fix, Brad