dnanexus-archive / dx-cwl

Import and run CWL workflows on DNAnexus (alpha)
Apache License 2.0
13 stars 6 forks source link

Idential final path names overwriting each other during staging to cwloutputs #20

Open chapmanb opened 6 years ago

chapmanb commented 6 years ago

Geet; I'm running into an issue with running workflows using files with identical final path names (in unique folders). I know we'd tackled this before for some cases. It happens within a step that don't do file downloads (just rearranging of inputs) in case that gives any clues. Here is the job that causes issues:

dx watch job-FGQ432j01VKXfjkf7bvFk338

The inputs start off with 3 validation files with the same final path name truth_small_variants.vcf.gz all of which have unique references to the original files in bcbio_resources under unique folders. (The 4th sample has a unique validation file name so is not relevant):

2018-06-12 19:03:34 batch_for_variantcall STDOUT  u'config__algorithm__validate': [{u'primaryFile': {u'$dnanexus_link': u'file-F8b76100f5vJVGbB6Y9PfYBJ'},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                    u'secondaryFiles': [{u'$dnanexus_link': u'file-F8b76280f5v8g5PgP5fq5yxV'}]},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                   {u'primaryFile': {u'$dnanexus_link': u'file-FGPBZk801VKpfp333zgV03P9'},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                    u'secondaryFiles': [{u'$dnanexus_link': u'file-FGPBf1Q01VKZ4xgZ9x91K11p'}]},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                   {u'primaryFile': {u'$dnanexus_link': u'file-F8b75yj0f5vG869V6X06Y20Y'},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                    u'secondaryFiles': [{u'$dnanexus_link': u'file-F8b76000f5v6J42G6X8V2VgG'}]},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                   {u'primaryFile': {u'$dnanexus_link': u'file-F8b75y00f5v7v6Z56Kkv7XVP'},
2018-06-12 19:03:34 batch_for_variantcall STDOUT                                    u'secondaryFiles': [{u'$dnanexus_link': u'file-F8b75y80f5v81Y3Y6Xq02Xf8'}]}],

Here's what the input files look like for a couple of them to give you an idea of the conflict with Name:

ID                  file-F8b76100f5vJVGbB6Y9PfYBJ
Class               file
Project             project-F541fX00f5v9vKJjJ34gvgbv
Folder              /reference_genomes/hg38/validation/giab-NA24631
Name                truth_small_variants.vcf.gz

ID                  file-F8b75yj0f5vG869V6X06Y20Y
Class               file
Project             project-F541fX00f5v9vKJjJ34gvgbv
Folder              /reference_genomes/hg38/validation/giab-NA24385
Name                truth_small_variants.vcf.gz

At the next stage we have the non-unique file name prefixed by the unique file reference:

2018-06-12 19:03:42 batch_for_variantcall STDOUT  u'config__algorithm__validate': [{'class': 'File',
2018-06-12 19:03:42 batch_for_variantcall STDOUT                                    'path': u'file-F8b76100f5vJVGbB6Y9PfYBJ/truth_small_variants.vcf.gz',
2018-06-12 19:03:42 batch_for_variantcall STDOUT                                    'secondaryFiles': [{'class': 'File',
2018-06-12 19:03:42 batch_for_variantcall STDOUT                                                        'path': u'file-F8b76280f5v8g5PgP5fq5yxV/truth_small_variants.vcf.gz.tbi'}]},

Then things start to go bad. All of the files get written over the top of each other in cwloutputs since they are not prefixed by a path:

2018-06-12 19:04:54 batch_for_variantcall STDOUT                   u'config__algorithm__validate': {u'basename': u'truth_small_variants.vcf.gz',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'checksum': u'sha1$c65170cdd32abf2ed3108ae09f6166ac4b983116',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'class': u'File',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'location': u'file:///home/dnanexus/cwloutputs/truth_small_variants.vcf.gz',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'path': u'/home/dnanexus/cwloutputs/truth_small_variants.vcf.gz',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'secondaryFiles': [{u'basename': u'truth_small_variants.vcf.gz.tbi',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'checksum': u'sha1$a95db6323b49363dbbe6046059761eeb9c7944c3',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'class': u'File',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'location': u'file:///home/dnanexus/cwloutputs/truth_small_variants.vcf.gz.tbi',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'path': u'/home/dnanexus/cwloutputs/truth_small_variants.vcf.gz.tbi',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'size': 30}],
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'size': 30},

2018-06-12 19:04:54 batch_for_variantcall STDOUT                   u'config__algorithm__validate': {u'basename': u'truth_small_variants.vcf.gz',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'checksum': u'sha1$fb224799e3c4de175ae5988e1be6b12ef2335122',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'class': u'File',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'location': u'file:///home/dnanexus/cwloutputs/truth_small_variants.vcf.gz',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'path': u'/home/dnanexus/cwloutputs/truth_small_variants.vcf.gz',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'secondaryFiles': [{u'basename': u'truth_small_variants.vcf.gz.tbi',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'checksum': u'sha1$77553d21664263b28f19a9c82fe5a46010794a42',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'class': u'File',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'location': u'file:///home/dnanexus/cwloutputs/truth_small_variants.vcf.gz.tbi',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'path': u'/home/dnanexus/cwloutputs/truth_small_variants.vcf.gz.tbi',
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                                         u'size': 30}],
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    u'size': 30},

Then the output files reference the same internal file due to this over-writing, and this gets passed on to subsequent steps:

2018-06-12 19:04:54 batch_for_variantcall STDOUT                   u'config__algorithm__validate': {'primaryFile': {u'$dnanexus_link': 'file-F8b75yj0f5vG869V6X06Y20Y'},
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    'secondaryFiles': [{u'$dnanexus_link': 'file-F8b75y80f5v81Y3Y6Xq02Xf8'}]},
[...]
2018-06-12 19:04:54 batch_for_variantcall STDOUT                   u'config__algorithm__validate': {'primaryFile': {u'$dnanexus_link': 'file-F8b75yj0f5vG869V6X06Y20Y'},
2018-06-12 19:04:54 batch_for_variantcall STDOUT                                                    'secondaryFiles': [{u'$dnanexus_link': 'file-F8b75y80f5v81Y3Y6Xq02Xf8'}]},

Thanks much for any pointers about what might be going on here and the best way to avoid. I've reached a dead end in reading the code at the point where these files appear to get staged through cwltool:

https://github.com/dnanexus/dx-cwl/blob/37290496127ad1b5346d8f630dc8361a146d9eb9/dx-cwl-applet-code.py#L119

I'm not totally sure I understand what happens here and if we can introduce some way to uniquify the outputs.

Thanks for any pointers or ideas to fix, Brad

geetduggal commented 6 years ago

Hi Brad, thanks for such a detailed report and investigation! I think based on just looking at your analysis the problem might be that cwltool moves outputs to the cwloutputs directory above and if they happen to have the same filename they are overwritten (I may be wrong about this, but that is my assumption here :) ). Based on this, I think the --leave-outputs could do the trick. e.g.

image

I have created this PR with the mod:

https://github.com/dnanexus/dx-cwl/pull/21/files

Could you give it a go and see if it helps? We could iterate from there to see what other issues this may spawn :)

chapmanb commented 6 years ago

Geet; Thanks much for this, this looks like it would do the trick. It sounds like we should pull this one and then I can re-run my failing workflow and test. Do you think we should leave --outdir as well? I wanted to make sure these got staged to a location that made sense space wise and it sounds from the docs like these two are complementary. The default is to output to the current directory so maybe we're good if it's already being run from the right place. Beyond this, I'd suggest pulling it in and we can test away. Thanks again.

geetduggal commented 6 years ago

Hi Brad, thanks! I went ahead and added --outdir back in. I ran a couple of tests and I'm actually not sure what --outdir does if --leave-outputs is provided. Specifically, the behavior seems identical whether or not I place --outdir on this example:

image

chapmanb commented 6 years ago

Thanks for investigating that. I can't quite tell the full command line in the first call as the screenshot only gives me half of it, but it looks like you have --outdir . which is also the default if you don't specify --outdir so that makes good sense and glad they don't clash. +1 for merging this and I'll be happy to run the problematic samples through to test. Thanks again.