Closed emilio-r closed 3 years ago
Hi @emilio-r ,
Just to confirm, when you run the sendsketch
scripts manually on these samples, they produce the desired results without failure?
It's quite curious why we are observing edge cases intermittently and would like to understand whether it's related to the design of the NF script or an issue with one of the tools we have used in the pipeline.
Hello @abhi18av,
Now we've used the pipe to rerun all those that failed the pass. This time 10/40 failed to make it through to sendsketch, but again had good shovill-results. Again they display this error:
nxf-scratch-dir s23:/scratch/3626937/nxf.fksV2koO4R
Adding /crex/proj/uppstore2017270/common_genomes/Achtman_genomes/rerun/work/conda/env-3b12f71134feb842065aa76592fe3f68/opt/bbmap-38.87-0/resources/blacklist_refseq_merged.sketch to blacklist.
0.059 seconds.
Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files.
0, 0, 0, 10000, true
at sketch.SketchObject.toSketchSize(SketchObject.java:911)
at sketch.SketchHeap.maxLen(SketchHeap.java:152)
at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161)
at sketch.Sketch.
This run only 1/40 failed to stage the shovill.log in the work-directory.
While this was now done with the pipe, I do remember that we have previously been able to pass failed samples through manual sendsketch (outside of the pipe), flawlessly. This we did when assessing another large dataset that produced failed sendsketches in BACTpipe.
While this was now done with the pipe, I do remember that we have previously been able to pass failed samples through manual sendsketch (outside of the pipe), flawlessly
So, I gather that it's an issue with the pipeline or perhaps with the file system latency since you mentioned that in the second run of the pipeline with the batch of 40 (which failed earlier), you were able to get the results for most of them.
Would it be possible for you to share the nextflow.log
file for this?
Are these issue with the pipeline only reproducible with rackham
profile or with other profiles as well?
@abhi18av, I have sent you the log for the smaller sample now. As for if the error persists when using other profiles, perhaps @thorellk or @boulund could provide some insight there, as I'm still having issues with my conda-setup and am unable to run it off from the HPC, with the rackham-profile.
@emilio-r , I have gone through the nextflow.log
and wasn't able to find any oddities there.
Please let me know if it'd be possible for you to isolate some samples which always fail (if any) using a local machine - this would greatly help in pin-piont and fix the bug.
In case, this is still intermittent, before picking up this issue I think it's better to containarize the pipeline https://github.com/ctmrbio/BACTpipe/issues/126 to make it reproducible. I'm no conda expert here, but I think containers do make it more reproducible.
And then, we could have a somewhat common ground in terms of infrastructure and explore if this skipping issue arises due to some specificities related to the HPC system or not.
I will run it locally on my laptop over night and see if I can replicate it or if it has to do with some read/write/forwarding issue on the HPC or something else.
Do we have a good test dataset somewhere that is easily accessible? I could also try to run this on a couple of different systems to see if I can replicate the issue
I now ran it on my laptop and 4/48 samples fails sendsketch with the log file similar to what @emilio-r gets on out HPC.
nxf-scratch-dir Kaisas-MBP-2.lan:/var/folders/5m/z0cl1bfx1xjdt_17wlr4r2j80000gn/T//nxf.GMq1yBPVmu
/Users/kaisa/BACTpipe_test/work/conda/env-a640826f9d786459503e77925c215de0/opt/bbmap-38.88-0//calcmem.sh: line 75: [: -v: unary operator expected
Max memory cannot be determined. Attempting to use 3200 MB.
If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,
or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.
Adding /Users/kaisa/BACTpipe_test/work/conda/env-a640826f9d786459503e77925c215de0/opt/bbmap-38.88-0/resources/blacklist_refseq_merged.sketch to blacklist.
0.009 seconds.
Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files.
0, 0, 0, 10000, true
at sketch.SketchObject.toSketchSize(SketchObject.java:912)
at sketch.SketchHeap.maxLen(SketchHeap.java:152)
at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161)
at sketch.Sketch.<init>(Sketch.java:56)
at sketch.Sketch.<init>(Sketch.java:52)
at sketch.SketchMakerMini.processInner(SketchMakerMini.java:176)
at sketch.SketchMakerMini.toSketches(SketchMakerMini.java:117)
at sketch.SketchTool.loadSketchesFromSequenceFile(SketchTool.java:377)
at sketch.SketchTool.loadSketchesFromFile(SketchTool.java:344)
at sketch.SketchTool.loadSketches_MT(SketchTool.java:292)
at sketch.SketchTool.loadSketches_MT(SketchTool.java:271)
at sketch.SketchTool.loadSketches_MT(SketchTool.java:261)
at sketch.SendSketch.processRemote(SendSketch.java:314)
at sketch.SendSketch.process(SendSketch.java:307)
at sketch.SendSketch.main(SendSketch.java:55)
Perfect!
@thorellk would it be possible for you share these samples with me? I'd like to have a sample set of 10, with these 4 included in the set.
Also, please do share the nextflow
command line with parameters you're using to run the workflow.
What do you think of trying this idea:
If there is some kind of issue with the output file from the Python script rename_fasta.py
being created inside the same process where sendsketch.sh
is consuming it. Couldn't we move the renaming script to the shovill step? Logically, I think it actually makes more sense to run that as part of the shovill process, rather than in the contamination screening process. We already have a line in the shovill process that just copies the contigs file to give it a new name, why not use rename_fasta.py
here instead?
For example, if we change this line (https://github.com/ctmrbio/BACTpipe/blob/master/modules/shovill/shovill.nf#L24) to:
rename_fasta.py \
--input ${pair_id}_shovill/contigs.fa \
--output ${pair_id}_contigs.fa \
--pre ${pair_id}_contig
I guess this will need some changes to the environment for the shovill process, as it now needs python.
I implemented my idea from above and tested it, works well! I accidentally pushed my changes directly to the develop branch without making a pull request, sorry about that!
I'm starting to become pretty sure the issues we've observed with some samples failing are related to scratch staging. Disabling scratch removed all issues for me completely.
Here's the commit if you want to review it https://github.com/ctmrbio/BACTpipe/commit/5135d0fa270a1ad993405590ec0423e009bbf5ff Note that this implies that python needs to be present in the execution environment of the shovill process. Perhaps @emilio-r or @abhi18av can ensure that this works also for the docker containers that have been prepared?
Perhaps @emilio-r or @thorellk could try to run once or twice on Rackham with scratch = false
in the rackham profile just to see if it behaves the same there?
Great that you have managed to narrow down a probable cause for this. I am not sure what the difference between different scratch settings are though, could you explain that to me?
Setting scratch = false
makes the nextflow process read the input files directly from the work dir on the shared network disk, and write output files directly into the work dir folder on the shared network disk.
With scratch = true
it instead creates a copy of the workdir on the $TMPDIR
on the compute node and copies (or symlinks, depends on the stageIn
and stageOut
settings) the input files there, and then copies the output files back to the work dir on the shared network disk. The idea is to reduce the strain on a shared network file system by not reading and writing directly to that.
In practice, reading and writing directly to the network file system is mostly a problem if the processes do a lot of random access and random writes. I think for the processes in BACTpipe, they all pretty much do linear reads of files into memory, do some processing, and the perform a linear write to the output file.
More reading in the Nextflow docs https://www.nextflow.io/docs/latest/process.html#scratch
Ah, now I recall that last year I was facing some issues like that and it disappeared after I updated the scratch
in local
config.
This time, I was a bit confounded about why it only happens with this particular step - is this the most I/O intensive one?🤔
I now ran 10 samples three times on the HPC, with the rackham profile set to scratch = false
. In two of the runs all 10 samples passed. In the third run, however, two samples still failed with the usual error:
Adding /crex/proj/uppstore2017270/Emilio/bactpipe-testing/testing_Fs/test/no_scratch_2/work/conda/env-3b12f71134feb842065aa76592fe3f68/opt/bbmap-38.89-0/resources/blacklist_refseq_merged.sketch to blacklist.
0.064 seconds.
Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files.
0, 0, 0, 10000, true
at sketch.SketchObject.toSketchSize(SketchObject.java:912)
at sketch.SketchHeap.maxLen(SketchHeap.java:152)
at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161)
at sketch.Sketch.
Note that i ran the exact same 10 samples, with the exact same settings, in three parallel instances of BACTpipe. This is as random as it gets...
Just to confirm, @boulund and @emilio-r you are running this on different HPC systems right?
Could it be something peculiar to the execution env setup?
While I cannot answer your last question @abhi18av, I can report some new findings on the matter: After having now truly stress tested sendsketch on its own by running it on over 3000 samples, on our HPC, I am seeing that roughly 9% fail being sketched and report the same error as stated here. Also, about 1% pass sendsketch, but result in strange output. So, whether the issue only affects our HPC or not, I do believe I can now say that it is not due to Shovill, Nextflow or either of the python-scripts but is due to the sendsketch program.
Can we add a retry directive to that process just to see if that reduces the overall fail-rate when processing many samples? It sounds like it is most often solved by doing a second run?
I guess this should be considered an expected issue when depending on a tool that requires communications with an external server.
Hmm, this is quite interesting 🤔
Anyways, it makes sense to add the following snippet directly to the base SCREEN...
process instead of any particular profile to avoid duplication of code.
errorStrategy "retry"
maxRetries 3
OR the following snippet to keep the entire pipeline from failing
errorStrategy { task.attempt<=3 ? 'retry' : 'finish' }
This needs to be added here
Also, @thorellk do you think it's worth mentioning in the docs?
As an aside (only for curiosity) , @boulund you mentioned that sendsketch
communicates to an external server and I imagine that it adds some pressure to the network load when multiple sekdsketch
are operating in parallel. Do you think that the networking infra makes a difference between the two HPC systems where the pipeline is being run?
The network requirements for sendsketch are minute. It communicates a sketch (a couple of kb at most) to a remote server, then the results is also very small (the text file out get as output).
Ah, okay understood. Thanks 👍
Hi everyone,
Just wanted to check in and see if you've tried out this suggestion here https://github.com/ctmrbio/BACTpipe/issues/144#issuecomment-775911315 on the HPC cluster.
I wouldn't be able to test this one due to the lack of infra and sample size on my end. Would be happy to assist in the debugging of any issues.
I have been off work the past two weeks, and will be working a very limited amount of time going forward because we just had a baby. Just want you to know not to expect me to have much time for anything in the near future 😊
That's great news @boulund ! Congratulations 🎉
Hi @abhi18av, sorry for the delayed answer. I have tried the "errorStrategy "retry" maxRetries 3" approach on the HPC now and, although I am seeing some other issues (time outs) I cannot rule out that they are due to the HPC itself. I will try running it again a little later but for now it seems that this may be a good workaround for the sendsketch-issues.
I have now tested this some more and find that with this strategy, all 45/45 tested samples pass and produce the desired result in MultiQC. Taken with the tests I ran yesterday, where all failures are believed to have come from HPC issues, I feel comfortable to say that this approach works well. While I can see that the odd sample may still fail all three retries, this should be just a negligible portion of samples, when compared to before.
That's great new @emilio-r !
I guess we can now move forward with the inclusion of the changes (via PR) and then close this issue. What do you think?
Great @emilio-r! Yes, @abhi18av I agree, let's do that.
The workaround has been merged to develop
and the issue can be closed.
When running the pipe (Master branch) on a large dataset of 476 samples we found that roughly 8% (40/476) of the samples went through shovill correctly (all produced results from the sendsketch-parallel statswrapper) but were not passed correctly over to sendsketch, who have empty work-directories, or anything else downstream.
They produce the same error as reported in #141:
xf-scratch-dir s81:/scratch/3596560/nxf.Oy84kQyxDT Adding /crex/proj/uppstore2017270/common_genomes/Achtman_genomes/subset_4/work/conda/env-3b12f71134feb842065aa76592fe3f68/opt/bbmap-38.87-0/resources/blacklist_refseq_merged.sketch to blacklist. 0.110 seconds. Exception in thread "main" java.lang.AssertionError: BBSketch does not currently support empty files. 0, 0, 0, 10000, true at sketch.SketchObject.toSketchSize(SketchObject.java:911) at sketch.SketchHeap.maxLen(SketchHeap.java:152) at sketch.SketchHeap.toSketchArray_minCount(SketchHeap.java:161) at sketch.Sketch.(Sketch.java:56)
at sketch.Sketch.(Sketch.java:52)
at sketch.SketchMakerMini.processInner(SketchMakerMini.java:175)
at sketch.SketchMakerMini.toSketches(SketchMakerMini.java:116)
at sketch.SketchTool.loadSketchesFromSequenceFile(SketchTool.java:376)
at sketch.SketchTool.loadSketchesFromFile(SketchTool.java:343)
at sketch.SketchTool.loadSketches_MT(SketchTool.java:291)
at sketch.SketchTool.loadSketches_MT(SketchTool.java:270)
at sketch.SketchTool.loadSketches_MT(SketchTool.java:260)
at sketch.SendSketch.processRemote(SendSketch.java:313)
at sketch.SendSketch.process(SendSketch.java:306)
at sketch.SendSketch.main(SendSketch.java:54)
Another 8% (38/476) of the samples had no shovill output staged in their work-directory, but passed thought the entire pipe just fine regardless. Could there be some issue in how shovill is passed to sendsketch that results in all the skipping we've seen?