fritzsedlazeck / Sniffles

Structural variation caller using third generation sequencing
Other
561 stars 95 forks source link

Updated --help #495

Open ACEnglish opened 4 months ago

ACEnglish commented 4 months ago

Hello,

I found the sniffles --help to be difficult to read. I've refactored it so that text is less wide and I've separated --help and --example (detailed below). This is technically a breaking change in that parameters which used the tobool as a type were all replaced with action="store_false". Therefore, any pipeline which has hard coded calls to e.g. sniffles --qc-stdev false would need to be updated to sniffles --qc-stdev

Note that the formatting in the below examples is a different from what would be seen in a terminal due to github applying formatting.

default `sniffles` output

usage: sniffles --input SORTED_INPUT.bam [--vcf OUTPUT.vcf] [--snf MERGEABLE_OUTPUT.snf] [--threads 4] [--mosaic] Sniffles2: A fast structural variant (SV) caller for long-read sequencing data Version 2.4 Contact: sniffles@romanek.at Use --help for full parameter information Use --example for detailed usage information sniffles: error: the following arguments are required: -i/--input

`sniffles --help` output

usage: sniffles --input SORTED_INPUT.bam [--vcf OUTPUT.vcf] [--snf MERGEABLE_OUTPUT.snf] [--threads 4] [--mosaic]

Sniffles2: A fast structural variant (SV) caller for long-read sequencing data
 Version 2.4
 Contact: sniffles@romanek.at

 Use --help for full parameter information
 Use --example for detailed usage information

options:
  -h, --help            show this help message and exit
  --example             Show example usage and exit
  --version             show program's version number and exit

Common parameters:
  -i IN [IN ...], --input IN [IN ...]
                        For single-sample calling: A coordinate-sorted and indexed .bam/.cram
                        (BAM/CRAM format) file containing aligned reads. - OR - For multi-sample
                        calling: Multiple .snf files (generated before by running Sniffles2 for
                        individual samples with --snf)
  -v OUT.vcf, --vcf OUT.vcf
                        VCF output filename to write the called and refined SVs to. If the given
                        filename ends with .gz, the VCF file will be automatically bgzipped and a
                        .tbi index built for it.
  --snf OUT.snf         Sniffles2 file (.snf) output filename to store candidates for later multi-
                        sample calling
  --reference REF.fa    (Optional) Reference sequence the reads were aligned against. To enable
                        output of deletion SV sequences, this parameter must be set.
  --tandem-repeats IN.bed
                        (Optional) Input .bed file containing tandem repeat annotations for the
                        reference genome.
  --regions REG.bed     (Optional) Only process the specified regions.
  -c, --contig          (Optional) Only process the specified contigs. May be given more than once.
  --phase               Determine phase for SV calls (requires the input alignments to be phased)
  -t, --threads         Number of parallel threads to use (4)

SV Filtering parameters:
  --minsupport          Min number of supporting reads for a SV to be reported (auto)
  --minsupport-auto-mult
                        Coverage based auto-minsupport multiplier for germline mode (0.1/0.025)
  --minsvlen            Min SV length in bp (50)
  --minsvlen-screen-ratio
                        Min length for SV candidates as fraction of --minsvlen (0.9)
  --mapq                Alignments with mapping quality lower than this value will be ignored
  --no-qc, --qc-output-all
                        Output all SV candidates, disregarding quality control steps
  --qc-stdev            Apply filtering based on SV start position and length standard deviation
  --qc-stdev-abs-max    Max standard deviation for SV length and size in bp (500)
  --qc-strand           Apply filtering based on strand support of SV calls
  --qc-coverage         Min surrounding region coverage of SV calls (1)
  --long-ins-length     Insertion SVs longer than this are subjected to more sensitive filtering
                        (2500)
  --long-del-length     Deletion SVs longer than this are subjected to central coverage drop-based
                        filtering. Not applicable for --mosaic (50000)
  --long-inv-length     Inversion SVs longer than this value are not subjected to central coverage
                        drop-based filtering (10000)
  --long-del-coverage   Long deletions with central coverage higher than this value will be
                        filtered. Not applicable for --mosaic (0.66)
  --long-dup-length     Duplication SVs longer than this value are subjected to central coverage
                        increase-based filtering. Not applicable for --mosaic (50000)
  --qc-bnd-filter-strand
                        Filter breakends that do not have support for both strands
  --bnd-min-split-length
                        Min length of read splits to be considered for breakends (1000)
  --long-dup-coverage   Long duplications with central coverage lower than this value will be
                        filtered. Not applicable for --mosaic (1.33)
  --max-splits-kb       Additional number of splits per kilobase read sequence allowed before reads
                        are ignored (0.1)
  --max-splits-base N   Base number of splits allowed before reads are ignored (3)
  --min-alignment-length
                        Reads with alignments shorter than this length in bp will be ignored
  --phase-conflict-threshold
                        Max fraction of conflicting reads permitted for SV phase information to be
                        labelled as PASS. Only for --phase (0.1)
  --detect-large-ins    Infer insertions that are longer than most reads and therefore are spanned
                        by few alignments only.

SV Clustering parameters:
  --cluster-binsize     Initial screening bin size in bp (100)
  --cluster-r           Multiplier for SV start position standard deviation criterion in cluster
                        merging (2.5)
  --cluster-repeat-h    Multiplier for mean SV length criterion for tandem repeat cluster merging
                        (1.5)
  --cluster-repeat-h-max
                        Max. merging distance based on SV length criterion for tandem repeat cluster
                        merging (1000)
  --cluster-merge-pos   Max. merging distance for insertions and deletions on the same read and
                        cluster in non-repeat regions (150)
  --cluster-merge-len   Max. size difference for merging SVs as fraction of SV length (0.33)
  --cluster-merge-bnd   Max. merging distance for breakend SV candidates (1000)

SV Genotyping parameters:
  --genotype-ploidy     Sample ploidy (2)
  --genotype-error      Estimated false positive rate for leads (0.05)
  --sample-id           Custom ID for this sample (SAMPLE))
  --genotype-vcf IN.vcf
                        Forced calling input.vcf

Multi-Sample Calling / Combine parameters:
  --combine-high-confidence
                        Min fraction of passed QC samples an SV needs (0.0)
  --combine-low-confidence
                        Min fraction of present samples an SV needs (0.2)
  --combine-low-confidence-abs
                        Min number of present samples an SV needs (2)
  --combine-null-min-coverage
                        Min coverage for a genotype to be reported as 0/0 instead of ./. (5)
  --combine-match       Multiplier for maximum deviation of multiple SV's start/end position for
                        them to be combined across samples. Given by
                        max_dev=M*sqrt(min(SV_length_a,SV_length_b)), where M is this parameter
                        (250)
  --combine-match-max   Upper limit for the max deviation computed for --combine-match, in bp (1000)
  --combine-separate-intra
                        Disable combination of SVs within the same sample
  --combine-output-filtered
                        Include low-confidence / mosaic SVs in multi-calling
  --combine-pair-relabel
                        Override low-quality genotypes when combining paired samples
  --combine-pair-relabel-threshold
                        Genotype quality minimum before relabeling (20)
  --combine-close-handles
                        Close .SNF file handles after each use to avoid opened files ulimit when
                        merging many samples.
  --combine-pctseq      Min alignment distance as percent of SV length to be merged. 0=off (0.7)

Output formatting parameters:
  --output-rnames       Output names supporting reads in INFO/RNAME
  --no-consensus        Disable consensus sequence generation for insertion SV calls
  --no-sort             Do not sort output VCF
  --no-progress         Disable progress display
  --quiet               Disable any non-error logging
  --max-del-seq-len     Max deletion sequence length in output before writing as symbolic \
                        (50000)
  --symbolic            Output all SVs as symbolic
  --allow-overwrite     Allow overwriting existing output files

Mosaic/somatic calling mode parameters:
  --mosaic              Turn on mosaic calling
  --mosaic-af-max       Max allele frequency for which SVs are considered mosaic (0.2)
  --mosaic-af-min       Min allele frequency for mosaic SVs to be output (0.05)
  --mosaic-qc-invdup-min-length
                        Min SV length for mosaic inversion and duplication SVs (500)
  --mosaic-qc-coverage-max-change-frac
                        Max relative coverage change across breakpoints (0.1)
  --mosaic-qc-strand    Apply filtering based on strand support of calls
  --mosaic-include-germline
                        Report germline SVs as well in mosaic mode

Developer parameters:
  --combine-consensus   Output the consensus genotype of all samples
  --qc-coverage-max-change-frac F
                        Max relative coverage change across SV breakpoints
`sniffles --example` output

sniffles example commands:

Call SVs for a single sample
     -> sniffles --input sorted_indexed_alignments.bam --vcf output.vcf

   ... OR, with CRAM input and bgzipped+tabix indexed VCF output:
     -> sniffles --input sample.cram --vcf output.vcf.gz

   ... OR, producing only a SNF file with SV candidates:
     -> sniffles --input sample1.bam --snf sample1.snf

   ... OR, simultaneously produce a single-sample VCF and SNF file:
     -> sniffles --input sample1.bam --vcf sample1.vcf.gz --snf sample1.snf

   ... OR, with tandem repeat annotations, reference (for DEL sequences) and mosaic mode for detecting rare SVs:
     -> sniffles --input sample1.bam --vcf sample1.vcf.gz --tandem-repeats tandem_repeats.bed --reference genome.fa --mosaic

Multi-sample calling
   Step 1. Create .snf for each sample:
     -> sniffles --input sample1.bam --snf sample1.snf
   Step 2. Combined calling:
     -> sniffles --input sample1.snf sample2.snf ... sampleN.snf --vcf multisample.vcf

   ... OR, using a .tsv file containing a list of .snf files and sample ids (one sample per line):
   Step 2. Combined calling:
     -> sniffles --input snf_files_list.tsv --vcf multisample.vcf

Determine genotypes for a set of known SVs (force calling)
     -> sniffles --input sample.bam --genotype-vcf input_known_svs.vcf --vcf output_genotypes.vcf