Illumina / Pisces

Somatic and germline variant caller for amplicon data. Recommended caller for tumor-only workflows.
GNU General Public License v3.0
94 stars 16 forks source link

Gemini 5.2.11 does not complete - errors for different chromosomes #55

Open kotliary opened 3 years ago

kotliary commented 3 years ago

Gemini (5.2.11 release) does not complete running WES BAM file selecting multiple chromosomes with --chromosomes argument. Every time I run Gemini I get different chromosomes with an error, some time the error code is 1, sometime 134 or 137 or 139. What those codes mean?

If I increase the memory requirement (I run it on a cluster nodes) from 20GB to 64GB I get less chromosomes with error, about half for 20GB, and just 2 for 64GB. With 128GB I get also 2 chromosomes as with 64GB. And the chromosomes with an error are always different, and it seems it doesn't depend on order they are processed.

BTW, when I run without specifying memory requirement all chromosome gave error code 137.

My command line:

dotnet $PISCES_DIR/bin/GeminiMulti/GeminiMulti.dll \
  --bam $BAM_FILE \
  --genome hg19/WholeGenomeFasta/ \
  --samtools $SAMTOOLS_PATH \
  --exePath $PISCES_DIR/bin/Gemini/Gemini.dll \
  --outFolder $OUT_DIR \
  --numProcesses 24 \
  --chromosomes chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY

The error:

9/15/20 4:23 PM 1  WARNING:   Processing failed for Gemini_chr2. See error log for details.
9/15/20 4:23 PM 1  WARNING:   Processing failed for Gemini_chr19. See error log for details.
9/15/20 4:23 PM 1  Exception reported:  
System.Exception: Application failed: 2 tasks failed.
   at GeminiMulti.GeminiMultiProcessor.ExecuteChromosomeJobs(ICliTaskManager cliTaskManager, Dictionary`2 chromRefIds, List`1 cmdLineList, String outMultiPath, String taskLogDir, String exePath, List`1 taskDirectories) in C:\Users\gberry\Downloads\Pisces5-release-Pisces_5_2_11_open\Pisces5-release-Pisces_5_2_11_open\src\exe\GeminiMulti\GeminiMultiProcessor.cs:line 228
   at GeminiMulti.GeminiMultiProcessor.Execute(ICliTaskManager cliTaskManager, Dictionary`2 chromRefIds, List`1 cmdLineList, ISamtoolsWrapper samtoolsWrapper) in C:\Users\gberry\Downloads\Pisces5-release-Pisces_5_2_11_open\Pisces5-release-Pisces_5_2_11_open\src\exe\GeminiMulti\GeminiMultiProcessor.cs:line 44
   at GeminiMulti.Program.ProgramExecution() in C:\Users\gberry\Downloads\Pisces5-release-Pisces_5_2_11_open\Pisces5-release-Pisces_5_2_11_open\src\exe\GeminiMulti\Program.cs:line 66
   at CommandLine.Application.BaseApplication`1.Execute() in C:\Users\gberry\Downloads\Pisces5-release-Pisces_5_2_11_open\Pisces5-release-Pisces_5_2_11_open\src\lib\CommandLine.Options\BaseApplication.cs:line 126.
9/15/20 4:23 PM 1  ******************** Ending ********************* 

There is no error in logs for individual chromosomes.

The minimum set of chromosomes I could successfully run is chr20,chr21,chr22,chrX,chrY. If I add one more chromosome it fails.

I have also tried he previous version, but had the same problem.

If you need an example BAM file, I can provide, but the files are big, over 20GB.

tamsen commented 3 years ago

Hi. I can take a look at this. Does it work OK when you don't specify chromosomes?

tamsen commented 3 years ago

just confirming, I'm finding some issues with the multiprocessor cli with 5.2.11. I'm trying to fix. I'll send an update when I have a patch

tamsen commented 3 years ago

HI!

I have a new release here with a few fixes (https://github.com/tamsen/Pisces/releases/tag/v5.3.0.0). It should help your issue (or at the very least, expose enough logging to help). Can you please download it and give it a try? When you run GeminiMulti, it should make two folders GeminiChromosomeLogs and GeminiMultiLogs. Hopefully if the error still occurs, we can see the the issue in those logs. At that point, we can drill down to the problem specific to your bam.

Note, the new command line is as below, and does NOT require dotnet or the ".dll" extension.

/new/pisces/path/pisces_all/GeminiMulti --bam my.bam --genome /Genomes/HomoSapiens/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta --samtools samtools --exePath /new/pisces/path/pisces_all/Gemini --outFolder out --numProcesses 24 --chromosomes chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY

Hope this helps, Tamsen

Manuel-DominguezCBG commented 7 months ago

I have the same problem here. I am using v5.3.0.0

This works


module load dotnet/2.0.3

pisces_all/pisces_all/GeminiMulti \
                                    -bam p5p7_FINAL_mapped.bam   \
                                    --exePath pisces_all/pisces_all/Gemini \
                                    -genome v0_masked_new_15092023_4_RACP-PISCES \
                                    -samtools  /local/software/samtools/1.16.1/bin/samtools \
                                    --outFolder Results \
                                    --numProcesses  5 \

but when I add this

--chromosomes chr1,chr2,chr3,chr4,chr5,chr6,chr7,chr8,chr9,chr10,chr11,chr12,chr13,chr14,chr15,chr16,chr17,chr18,chr19,chr20,chr21,chr22,chrX,chrY

I don't get any error message (unlike kotliary) but I get an FAILED (exit code 201) What this means for you??

These are the last lines of the log file


Time: 00:00:58.9
2/20/24 8:39 AM 4  PROCESS Gemini_chrX: ExitCode: 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr1 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr2 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr3 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr4 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr5 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr6 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr7 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr8 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr9 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr10 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr11 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr12 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr13 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr14 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr15 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr16 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr17 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr18 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr19 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr20 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr21 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chr22 with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chrX with exit code 0.
2/20/24 8:39 AM 1  Completed task Gemini_chrY with exit code 0.
2/20/24 8:39 AM 1  Completed 24 tasks.
2/20/24 8:39 AM 1  Calling samtools cat on 24 files to create p5p7_FINAL_mapped.PairRealigned.bam.
2/20/24 8:39 AM 1  Calling final samtools cat on 24 bams with output at p5p7_FINAL_mapped.PairRealigned.bam.
2/20/24 8:39 AM 1  Calling samtools index on p5p7_FINAL_mapped.PairRealigned.bam.
2/20/24 8:39 AM 1  Done finalizing bam.
2/20/24 8:39 AM 1  Consolidating log files.
2/20/24 8:39 AM 1  Deleting intermediate files.
2/20/24 8:39 AM 1  Done cleaning up.
2/20/24 8:39 AM 1  ******************** Ending *********************

Time: 00:02:31.0
---------------------------------------------------------------------------
Pisces Software                                  GNU GENERAL PUBLIC LICENSE
https://github.com/tamsen/Pisces                                    5.3.0.0
---------------------------------------------------------------------------

Please reference 'Tamsen Dunn, Gwenn Berry, Dorothea Emig-Agius, Yu Jiang, Serena Lei, Anita Iyer, Nitin Udar, Han-Yu Chuang, Jeff Hegarty, Michael Dickover, Brandy Klotzle, Justin Robbins, Marina Bibikova, Marc Peeters, Michael Strömberg, Pisces: an accurate and versatile variant caller for somatic and germline next-generation sequencing data, Bioinformatics, Volume 35, Issue 9, 1 May 2019, Pages 1579–1581, https://doi.org/10.1093/bioinformatics/bty849'

---------------------------------------------------------------------------

USAGE: dotnet GeminiMulti.dll  --bam <bam path> --genome <genome path> --samtools <samtools path> --outFolder <output path> --numProcesses 20 --exePath <path to gemini subprocess>
GeminiMulti: pair-aware indel realigner and read stitcher

REQUIRED:
      --bam <PATH>           PATH to the original bam file. (Required).
      --genome <PATH>        PATH to the genome directory. (Required).
      --samtools <PATH>      PATH to the samtools executable. (Required).
      --numprocesses <INT>   INT indicating the number of Gemini subprocesses
                               to run. (Required).
      --exepath <PATH>       PATH to the executable file for the Gemini
                               subprocess. (Required).
      --outfolder <PATH>     PATH of directory in which to create the new bam
                               file. (Required).

COMMON:
      --samtoolsoldstyle <BOOL>
                             BOOL Whether the provided samtools executable is
                               the old version that uses an output prefix
                               rather than an explicit '-o' output option
                               (http://www.htslib.org/doc/samtools-1.1.htm).
                               Default: false.
      --keepbothsidesoftclips <BOOL>
                             BOOL Whether to trust that both-side softclips
                               are probe and should stay softclipped. Default:
                               false.
      --trustsoftclips <BOOL>
                             BOOL Whether to trust softclips. If true, having
                               softclips doesn't automatically trigger indel
                               realignment. Also, won't try to stitch the
                               softclips. Default: false.
      --keepprobe <BOOL>     BOOL Whether to trust that probe-side softclips
                               are probe and should stay softclipped. Default:
                               false.
      --remaskmessysoftclips <BOOL>
                             BOOL If true, read-ends that were originally
                               softclipped and are still highly mismatching to
                               reference after realignment are re-softclipped,
                               even if not configured to keep probe softclips.
                               If false, only N-softclips are remasked when not
                               keeping probe softclips.  Default value is false.
      --stitchonly <BOOL>    BOOL Whether to only perform stitching, skipping
                               realignment.
      --realignonly <BOOL>   BOOL Whether to only perform realignment,
                               skipping stitching.
      --help, -h             displays the help menu
      --version, -v          displays the version

GEMINI_MULTI:
      --multiprocess <BOOL>  BOOLWhether to use multi-process, as opposed to
                               multi-thread, processing for each chromosome.
                               Default: true.
      --chromosomes <LIST>   LISTComma-separated list of chromosomes to
                               process, if only processing particular
                               chromosomes. Default: empty (all chromosomes
                               will be processed).

STITCHING:
      --minbasecallquality <INT>
                             INT Cutoff for which, in case of a stitching
                               conflict, bases with qscore less than this value
                               will automatically be disregarded in favor of
                               the mate's bases.
      --nifydisagreement <BOOL>
                             BOOL Whether or not to turn high-quality
                               disagreeing overlap bases to Ns. Default: false.
      --maxreadlength <INT>  INT Maximum expected length of individual reads,
                               used to determine the maximum expected stitched
                               read length (2*len - 1). For optimal performanc-
                               e, set as low as appropriate (i.e. the actual
                               single-read length + max deletion length you
                               expect to stitch) for your data. Default: 1024.
      --dontstitchrepeatoverlap <BOOL>
                             BOOL Whether to not stitch read pairs whose only
                               overlap is a repeating sequence. Default: true.
      --ignorereadsabovemaxlength <BOOL>
                             BOOL Whether to passively ignore read pairs that
                               would be above the max stitched length (e.g.
                               extremely long deletions). Default: false.
      --countnstowarddisagreeingbases <BOOL>
                             BOOL Whether to count overlapping-base
                               disagreements where one of the mates reports an
                               'N' as a full-force disagreement (ie Nify the
                               base if configured to do so, and count toward
                               the number of disagreements in determining
                               whether the stitching result should be rejected-
                               ). Default: false.
      --maxnumdisagreeingstitchedbases <INT>
                             INT Maximum number of stitched bases that can
                               disagree between the two reads before a stitched
                               read is rejected. Default: int.MaxValue
      --stringtagstokeepfromr1 <LIST>
                             LIST Comma-delimited list of string tags to
                               retain from read 1 when stitching. Default: none.

READ_FILTERING:
      --skipandremovedups <BOOL>
                             BOOL Whether to skip and remove duplicates.
                               Default: True.
      --minmapquality <INT>  INT Reads pairs with map quality less than this
                               value should be filtered. If only one mate in a
                               pair has a low map quality, it is treated as
                               Split (or derivations thereof). Should not be
                               negative. Default: 1.
      --filterforproperpairs <BOOL>
                             BOOL Whether reads marked as not proper pairs
                               shall be filtered. Default: false.
      --treatabnormalorientationasimproper <BOOL>
                             BOOL Whether to treat non-F1R2/F2R1 read pairs
                               as improper even if flagged as properly paired.
                               Default: False.

REALIGNMENT:
      --maskpartialinsertion <BOOL>
                             BOOL Option to softclip a partial insertion at
                               the end of a realigned read (a complete but un-
                               anchored insertion is allowed).  Default: false.
      --minimumunanchoredinsertionlength <INT>
                             INT Minimum length of an unanchored insertion (-
                               i.e. no flanking reference base on one side)
                               allowed in a realigned read. Insertions shorter
                               than the specified length will be softclipped.
                               Default value is 0, i.e. allowing unanchored
                               insertions of any length.
      --softclipunknownindels <BOOL>
                             BOOL Whether to softclip out unknown indels.
                               Default: false.
      --checksoftclipsformismatches <BOOL>
                             BOOL Whether to count mismatches in softclips
                               toward total mismatches. Default: false.
      --trackmismatches <BOOL>
                             BOOL Whether to track and compare mismatches
                               when realigning. Default: false.
      --categoriestorealign <LIST>
                             LIST Category names that should be attempted to
                               realign. Default: ImperfectStitched,FailStitc-
                               h,UnstitchIndel,Unstitchable,Disagre-
                               e,MessyStitched,MessySplit,UnstitchImperfec-
                               t,LongFragment,UnstitchMess-
                               y,UnstitchForwardMessy,UnstitchReverseMess-
                               y,UnstitchForwardMessyInde-
                               l,UnstitchReverseMessyInde-
                               l,UnstitchMessySuspiciousRea-
                               d,UnstitchMessyIndelSuspiciousRea-
                               d,UnstitchMessySuspiciousMd
      --categoriestosnowball <LIST>
                             LIST Category names that should be attempted to
                               snowball. Default: none.
      --pairawareeverything <BOOL>
                             BOOL Whether to pass everything through pair
                               aware realignment, or just the expected
                               categories (Disagree, FailStitch, UnstitchIndel-
                               ). Default: false.
      --forcehighlikelihoodrealigners <BOOL>
                             BOOL Whether to force realignment in high-
                               likelihood categories even if the neighborhood
                               would not have been eligible for realignment.
                               Default: false.

INDEL_FILTERING:
      --minpreferredsupport <INT>
                             INT Instances of a found variant before it can
                               be considered to realign around. Default: 3.
      --minpreferredanchor <INT>
                             INT Minimum anchor around indel to count an
                               observation toward good evidence. Default: 1.
      --minrequiredindelsupport <INT>
                             INT Don't even allow otherwise strong indels
                               that we attempt to rescue in if they have num
                               observations below this. Default: 0.
      --minrequiredanchor <INT>
                             INT Don't even allow otherwise strong indels
                               that we attempt to rescue in if they have min
                               anchor below this. Default: 0.
      --maxmessthreshold <INT>
                             INT Don't allow indels with average mess above
                               this value. Default: 20.
      --binsize <INT>        INT Size of bin within which to consider indels
                               overlapping and eligible for pruning. Default: 0
                               (do not clean up).
      --requirepositiveoutcomeforsnowball <BOOL>
                             BOOL Whether to filter out indels that did not
                               have any realignment attempts at all during
                               snowballing (stricter than base level of
                               filtering indels that had failed realignment
                               attempts). Default: True.

REALIGNMENT_BINS:
      --messysitethreshold <INT>
                             INT Minimum (raw) number of messy-type reads
                               that must be present in a neighborhood for it to
                               be considered messy and a potential realignable
                               neighborhood. Must also meet the frequency
                               thresholds. Default: 1.
      --messysitewidth <INT> INT Neighborhood width to use when binning
                               realignment eligibility signals. Default: 500.
      --collectdepth <BOOL>  BOOL When collecting realignment eligibility
                               signals, whether to collect depth to gauge
                               frequency information. Default: True.
      --imperfectfreqthreshold <FLOAT>
                             FLOAT Proportion of imperfect reads in bin below
                               which we should not bother to realign. Should be
                               proportional to detection limit and bin width.
                               Default: 0.03.
      --indelregionfreqthreshold <FLOAT>
                             FLOAT Proportion of imperfect reads in bin below
                               which we should not bother to realign. Should be
                               proportional to detection limit and bin width.
                               Default: 0.01.
      --regiondepththreshold <INT>
                             INT When collecting realignment eligibility
                               signals and depth, minimum total number of reads
                               in a neighborhood below which the neighborhood
                               would be ineligible for realignment. Default: 5.
      --recalculateusablesitesaftersnowball <BOOL>
                             BOOL Whether to recalculate site usability after
                               snowballing. Default: True.

PROCESSING:
      --readcachesize <INT>  INT Batch size. Default: 1000.
      --regionsize <INT>     INT Size of genomic region to process at one
                               time. Appropriate setting depends upon read
                               depth, density and available memory. Default:
                               10000000.
      --numconcurrentregions <INT>
                             INT Number of concurrent regions to hold in
                               memory/process at once. Default: 1.
      --maxnumthreads <INT>  INT Maximum number of threads per process.
                               Default: 1.

READ_SILENCING:
      --directionalmessthreshold <FLOAT>
                             FLOAT Proportion of directionally messy
                               (ForwardMessy or ReverseMessy, etc) reads in
                               neighborhood above which we should silence the
                               affected mates. Default: 0.2.
      --messymapq <INT>      INT Mapping quality of reads below which, when
                               combined with high mismatch/softclips, a read is
                               considered a suspicious/multi-mapping messy rea-
                               d. Default: 30.
      --silencesuspiciousmdreads <BOOL>
                             BOOL Whether to silence read pairs whose MD tags
                               indicate suspicion. Default: False.
      --silencedirectionalmessreads <BOOL>
                             BOOL Whether to silence read mates which are
                               very messy and have clean mates, given that the
                               proportion of such reads in the neighborhood
                               exceeds DirectionalMessThreshold. Default: False.
      --silencemessymapmessreads <BOOL>
                             BOOL Whether to silence read pairs that are
                               messy and have one or both mates with mapping
                               quality below MessyMapq, given that the
                               proportion of such reads in the neighborhood
                               exceeds DirectionalMessThreshold. Default: False.

DEBUG:
      --logregionsandrealignments <BOOL>
                             BOOL Debug option to write region stats to the
                               log. Default: False.
      --lightdebug <BOOL>    BOOL Whether to log minimal debug logging.
                               Default: false.
      --debug <BOOL>         BOOL Whether we should run in debug (verbose)
                               mode. Default: false.
      --keepunmerged <BOOL>  BOOL Whether to keep unmerged bams, for
                               debugging. Default: false.

READ_CLASSIFICATION:
      --numsoftclipstobeconsideredmessy <INT>
                             INT When classifying reads (eg imperfect, messy,
                               directional messy), the min number of softclips
                               that will trigger one of the messy
                               classifications, given that softclips are not to
                               be trusted. Default: 8.
      --nummismatchestobeconsideredmessy <INT>
                             INT When classifying reads (eg imperfect, messy,
                               directional messy), the min number of mismatches
                               that will trigger one of the messy
                               classifications. Default: 3.

5.3.0.0

Some problems were encountered when parsing the command line options:

For a complete list of command line options, type "GeminiMulti -h"
==============================================================================
Running epilogue script on gold51.

Submit time  : 2024-02-20T08:36:35
Start time   : 2024-02-20T08:36:49
End time     : 2024-02-20T08:39:20
Elapsed time : 00:02:31 (Timelimit=20:00:00)

Job ID: 5469291
Cluster: i5
User/Group: mdb1c20/af
State: FAILED (exit code 201)
Cores: 1
CPU Utilized: 00:02:30
CPU Efficiency: 99.34% of 00:02:31 core-walltime
Job Wall-clock time: 00:02:31
Memory Utilized: 2.20 GB
Memory Efficiency: 6.67% of 33.00 GB

The output looks good to me. I mean, the new BAM is created ( I have not checked in details) GeminiChromosomeLogs GeminiMultiLogs p5p7_FINAL_mapped.PairRealigned.bam p5p7_FINAL_mapped.PairRealigned.bam.bai

but I would like to know where the list of the indels is. I cannot find this in GeminiChromosomeLogs or GeminiMultiLogs.

Regards,