Shovill-Sendsketch skipping

emilio-r commented 3 years ago

When running the pipe (Master branch) on a large dataset of 476 samples we found that roughly 8% (40/476) of the samples went through shovill correctly (all produced results from the sendsketch-parallel statswrapper) but were not passed correctly over to sendsketch, who have empty work-directories, or anything else downstream.

They produce the same error as reported in #141:

xf-scratch-dir s81:/scratch/3596560/nxf.Oy84kQyxDT Adding /crex/proj/uppstore2017270/common_genomes/Achtman_genomes/subset_4/work/conda/env-3b12f71134feb842065aa76592fe3f68/opt/bbmap-38.87-0/resources/blacklist_refseq_merged.sketch to blacklist. 0.110 seconds. Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files. 0, 0, 0, 10000, true at sketch.SketchObject.toSketchSize(SketchObject.java:911) at sketch.SketchHeap.maxLen(SketchHeap.java:152) at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161) at sketch.Sketch.(Sketch.java:56) at sketch.Sketch.(Sketch.java:52) at sketch.SketchMakerMini.processInner(SketchMakerMini.java:175) at sketch.SketchMakerMini.toSketches(SketchMakerMini.java:116) at sketch.SketchTool.loadSketchesFromSequenceFile(SketchTool.java:376) at sketch.SketchTool.loadSketchesFromFile(SketchTool.java:343) at sketch.SketchTool.loadSketches_MT(SketchTool.java:291) at sketch.SketchTool.loadSketches_MT(SketchTool.java:270) at sketch.SketchTool.loadSketches_MT(SketchTool.java:260) at sketch.SendSketch.processRemote(SendSketch.java:313) at sketch.SendSketch.process(SendSketch.java:306) at sketch.SendSketch.main(SendSketch.java:54)

Another 8% (38/476) of the samples had no shovill output staged in their work-directory, but passed thought the entire pipe just fine regardless. Could there be some issue in how shovill is passed to sendsketch that results in all the skipping we've seen?

abhi18av commented 3 years ago

Hi @emilio-r ,

Just to confirm, when you run the sendsketch scripts manually on these samples, they produce the desired results without failure?

It's quite curious why we are observing edge cases intermittently and would like to understand whether it's related to the design of the NF script or an issue with one of the tools we have used in the pipeline.

emilio-r commented 3 years ago

Hello @abhi18av,

Now we've used the pipe to rerun all those that failed the pass. This time 10/40 failed to make it through to sendsketch, but again had good shovill-results. Again they display this error:

nxf-scratch-dir s23:/scratch/3626937/nxf.fksV2koO4R Adding /crex/proj/uppstore2017270/common_genomes/Achtman_genomes/rerun/work/conda/env-3b12f71134feb842065aa76592fe3f68/opt/bbmap-38.87-0/resources/blacklist_refseq_merged.sketch to blacklist. 0.059 seconds. Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files. 0, 0, 0, 10000, true at sketch.SketchObject.toSketchSize(SketchObject.java:911) at sketch.SketchHeap.maxLen(SketchHeap.java:152) at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161) at sketch.Sketch.(Sketch.java:56) at sketch.Sketch.(Sketch.java:52) at sketch.SketchMakerMini.processInner(SketchMakerMini.java:175) at sketch.SketchMakerMini.toSketches(SketchMakerMini.java:116) at sketch.SketchTool.loadSketchesFromSequenceFile(SketchTool.java:376) at sketch.SketchTool.loadSketchesFromFile(SketchTool.java:343) at sketch.SketchTool.loadSketches_MT(SketchTool.java:291) at sketch.SketchTool.loadSketches_MT(SketchTool.java:270) at sketch.SketchTool.loadSketches_MT(SketchTool.java:260) at sketch.SendSketch.processRemote(SendSketch.java:313) at sketch.SendSketch.process(SendSketch.java:306) at sketch.SendSketch.main(SendSketch.java:54)

This run only 1/40 failed to stage the shovill.log in the work-directory.

While this was now done with the pipe, I do remember that we have previously been able to pass failed samples through manual sendsketch (outside of the pipe), flawlessly. This we did when assessing another large dataset that produced failed sendsketches in BACTpipe.

abhi18av commented 3 years ago

While this was now done with the pipe, I do remember that we have previously been able to pass failed samples through manual sendsketch (outside of the pipe), flawlessly

So, I gather that it's an issue with the pipeline or perhaps with the file system latency since you mentioned that in the second run of the pipeline with the batch of 40 (which failed earlier), you were able to get the results for most of them.

Would it be possible for you to share the nextflow.log file for this?

abhi18av commented 3 years ago

Are these issue with the pipeline only reproducible with rackham profile or with other profiles as well?

emilio-r commented 3 years ago

@abhi18av, I have sent you the log for the smaller sample now. As for if the error persists when using other profiles, perhaps @thorellk or @boulund could provide some insight there, as I'm still having issues with my conda-setup and am unable to run it off from the HPC, with the rackham-profile.

abhi18av commented 3 years ago

@emilio-r , I have gone through the nextflow.log and wasn't able to find any oddities there.

Please let me know if it'd be possible for you to isolate some samples which always fail (if any) using a local machine - this would greatly help in pin-piont and fix the bug.

In case, this is still intermittent, before picking up this issue I think it's better to containarize the pipeline https://github.com/ctmrbio/BACTpipe/issues/126 to make it reproducible. I'm no conda expert here, but I think containers do make it more reproducible.

And then, we could have a somewhat common ground in terms of infrastructure and explore if this skipping issue arises due to some specificities related to the HPC system or not.

thorellk commented 3 years ago

I will run it locally on my laptop over night and see if I can replicate it or if it has to do with some read/write/forwarding issue on the HPC or something else.

boulund commented 3 years ago

Do we have a good test dataset somewhere that is easily accessible? I could also try to run this on a couple of different systems to see if I can replicate the issue

thorellk commented 3 years ago

I now ran it on my laptop and 4/48 samples fails sendsketch with the log file similar to what @emilio-r gets on out HPC.

nxf-scratch-dir Kaisas-MBP-2.lan:/var/folders/5m/z0cl1bfx1xjdt_17wlr4r2j80000gn/T//nxf.GMq1yBPVmu
/Users/kaisa/BACTpipe_test/work/conda/env-a640826f9d786459503e77925c215de0/opt/bbmap-38.88-0//calcmem.sh: line 75: [: -v: unary operator expected
Max memory cannot be determined.  Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, 
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
Adding /Users/kaisa/BACTpipe_test/work/conda/env-a640826f9d786459503e77925c215de0/opt/bbmap-38.88-0/resources/blacklist_refseq_merged.sketch to blacklist.
0.009 seconds.
Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files.
0, 0, 0, 10000, true
    at sketch.SketchObject.toSketchSize(SketchObject.java:912)
    at sketch.SketchHeap.maxLen(SketchHeap.java:152)
    at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161)
    at sketch.Sketch.<init>(Sketch.java:56)
    at sketch.Sketch.<init>(Sketch.java:52)
    at sketch.SketchMakerMini.processInner(SketchMakerMini.java:176)
    at sketch.SketchMakerMini.toSketches(SketchMakerMini.java:117)
    at sketch.SketchTool.loadSketchesFromSequenceFile(SketchTool.java:377)
    at sketch.SketchTool.loadSketchesFromFile(SketchTool.java:344)
    at sketch.SketchTool.loadSketches_MT(SketchTool.java:292)
    at sketch.SketchTool.loadSketches_MT(SketchTool.java:271)
    at sketch.SketchTool.loadSketches_MT(SketchTool.java:261)
    at sketch.SendSketch.processRemote(SendSketch.java:314)
    at sketch.SendSketch.process(SendSketch.java:307)
    at sketch.SendSketch.main(SendSketch.java:55)

abhi18av commented 3 years ago

Perfect!

@thorellk would it be possible for you share these samples with me? I'd like to have a sample set of 10, with these 4 included in the set.

Also, please do share the nextflow command line with parameters you're using to run the workflow.

boulund commented 3 years ago

What do you think of trying this idea:

If there is some kind of issue with the output file from the Python script rename_fasta.py being created inside the same process where sendsketch.sh is consuming it. Couldn't we move the renaming script to the shovill step? Logically, I think it actually makes more sense to run that as part of the shovill process, rather than in the contamination screening process. We already have a line in the shovill process that just copies the contigs file to give it a new name, why not use rename_fasta.py here instead?

For example, if we change this line (https://github.com/ctmrbio/BACTpipe/blob/master/modules/shovill/shovill.nf#L24) to:

    rename_fasta.py \
        --input ${pair_id}_shovill/contigs.fa \
        --output ${pair_id}_contigs.fa \
        --pre ${pair_id}_contig

I guess this will need some changes to the environment for the shovill process, as it now needs python.

boulund commented 3 years ago

I implemented my idea from above and tested it, works well! I accidentally pushed my changes directly to the develop branch without making a pull request, sorry about that!

I'm starting to become pretty sure the issues we've observed with some samples failing are related to scratch staging. Disabling scratch removed all issues for me completely.

boulund commented 3 years ago

Here's the commit if you want to review it https://github.com/ctmrbio/BACTpipe/commit/5135d0fa270a1ad993405590ec0423e009bbf5ff Note that this implies that python needs to be present in the execution environment of the shovill process. Perhaps @emilio-r or @abhi18av can ensure that this works also for the docker containers that have been prepared?

Perhaps @emilio-r or @thorellk could try to run once or twice on Rackham with scratch = false in the rackham profile just to see if it behaves the same there?

thorellk commented 3 years ago

Great that you have managed to narrow down a probable cause for this. I am not sure what the difference between different scratch settings are though, could you explain that to me?

boulund commented 3 years ago

Setting scratch = false makes the nextflow process read the input files directly from the work dir on the shared network disk, and write output files directly into the work dir folder on the shared network disk.

With scratch = true it instead creates a copy of the workdir on the $TMPDIR on the compute node and copies (or symlinks, depends on the stageIn and stageOut settings) the input files there, and then copies the output files back to the work dir on the shared network disk. The idea is to reduce the strain on a shared network file system by not reading and writing directly to that.

In practice, reading and writing directly to the network file system is mostly a problem if the processes do a lot of random access and random writes. I think for the processes in BACTpipe, they all pretty much do linear reads of files into memory, do some processing, and the perform a linear write to the output file.

More reading in the Nextflow docs https://www.nextflow.io/docs/latest/process.html#scratch

abhi18av commented 3 years ago

Ah, now I recall that last year I was facing some issues like that and it disappeared after I updated the scratch in local config.

This time, I was a bit confounded about why it only happens with this particular step - is this the most I/O intensive one?🤔

emilio-r commented 3 years ago

I now ran 10 samples three times on the HPC, with the rackham profile set to scratch = false. In two of the runs all 10 samples passed. In the third run, however, two samples still failed with the usual error:

Adding /crex/proj/uppstore2017270/Emilio/bactpipe-testing/testing_Fs/test/no_scratch_2/work/conda/env-3b12f71134feb842065aa76592fe3f68/opt/bbmap-38.89-0/resources/blacklist_refseq_merged.sketch to blacklist. 0.064 seconds. Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files. 0, 0, 0, 10000, true at sketch.SketchObject.toSketchSize(SketchObject.java:912) at sketch.SketchHeap.maxLen(SketchHeap.java:152) at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161) at sketch.Sketch.(Sketch.java:56) at sketch.Sketch.(Sketch.java:52) at sketch.SketchMakerMini.processInner(SketchMakerMini.java:176) at sketch.SketchMakerMini.toSketches(SketchMakerMini.java:117) at sketch.SketchTool.loadSketchesFromSequenceFile(SketchTool.java:377) at sketch.SketchTool.loadSketchesFromFile(SketchTool.java:344) at sketch.SketchTool.loadSketches_MT(SketchTool.java:292) at sketch.SketchTool.loadSketches_MT(SketchTool.java:271) at sketch.SketchTool.loadSketches_MT(SketchTool.java:261) at sketch.SendSketch.processRemote(SendSketch.java:314) at sketch.SendSketch.process(SendSketch.java:307) at sketch.SendSketch.main(SendSketch.java:55)

Note that i ran the exact same 10 samples, with the exact same settings, in three parallel instances of BACTpipe. This is as random as it gets...

abhi18av commented 3 years ago

Just to confirm, @boulund and @emilio-r you are running this on different HPC systems right?

Could it be something peculiar to the execution env setup?

emilio-r commented 3 years ago

While I cannot answer your last question @abhi18av, I can report some new findings on the matter: After having now truly stress tested sendsketch on its own by running it on over 3000 samples, on our HPC, I am seeing that roughly 9% fail being sketched and report the same error as stated here. Also, about 1% pass sendsketch, but result in strange output. So, whether the issue only affects our HPC or not, I do believe I can now say that it is not due to Shovill, Nextflow or either of the python-scripts but is due to the sendsketch program.

boulund commented 3 years ago

Can we add a retry directive to that process just to see if that reduces the overall fail-rate when processing many samples? It sounds like it is most often solved by doing a second run?

I guess this should be considered an expected issue when depending on a tool that requires communications with an external server.

abhi18av commented 3 years ago

Hmm, this is quite interesting 🤔

Anyways, it makes sense to add the following snippet directly to the base SCREEN... process instead of any particular profile to avoid duplication of code.

errorStrategy "retry"
maxRetries 3

OR the following snippet to keep the entire pipeline from failing


errorStrategy { task.attempt<=3 ? 'retry' : 'finish' }

This needs to be added here

https://github.com/ctmrbio/BACTpipe/blob/36d0bad23d4e29d28e64c90b24e3aa117eae5682/modules/screen_for_contaminants/screen_for_contaminants.nf#L5

Also, @thorellk do you think it's worth mentioning in the docs?

As an aside (only for curiosity) , @boulund you mentioned that sendsketch communicates to an external server and I imagine that it adds some pressure to the network load when multiple sekdsketch are operating in parallel. Do you think that the networking infra makes a difference between the two HPC systems where the pipeline is being run?

boulund commented 3 years ago

The network requirements for sendsketch are minute. It communicates a sketch (a couple of kb at most) to a remote server, then the results is also very small (the text file out get as output).

abhi18av commented 3 years ago

Ah, okay understood. Thanks 👍

abhi18av commented 3 years ago

Hi everyone,

Just wanted to check in and see if you've tried out this suggestion here https://github.com/ctmrbio/BACTpipe/issues/144#issuecomment-775911315 on the HPC cluster.

I wouldn't be able to test this one due to the lack of infra and sample size on my end. Would be happy to assist in the debugging of any issues.

boulund commented 3 years ago

I have been off work the past two weeks, and will be working a very limited amount of time going forward because we just had a baby. Just want you to know not to expect me to have much time for anything in the near future 😊

abhi18av commented 3 years ago

That's great news @boulund ! Congratulations 🎉

emilio-r commented 3 years ago

Hi @abhi18av, sorry for the delayed answer. I have tried the "errorStrategy "retry" maxRetries 3" approach on the HPC now and, although I am seeing some other issues (time outs) I cannot rule out that they are due to the HPC itself. I will try running it again a little later but for now it seems that this may be a good workaround for the sendsketch-issues.

emilio-r commented 3 years ago

I have now tested this some more and find that with this strategy, all 45/45 tested samples pass and produce the desired result in MultiQC. Taken with the tests I ran yesterday, where all failures are believed to have come from HPC issues, I feel comfortable to say that this approach works well. While I can see that the odd sample may still fail all three retries, this should be just a negligible portion of samples, when compared to before.

abhi18av commented 3 years ago

That's great new @emilio-r !

I guess we can now move forward with the inclusion of the changes (via PR) and then close this issue. What do you think?

thorellk commented 3 years ago

Great @emilio-r! Yes, @abhi18av I agree, let's do that.

abhi18av commented 3 years ago

The workaround has been merged to develop and the issue can be closed.

KThorellGroup / BACTpipe

Shovill-Sendsketch skipping #144