broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.65k stars 582 forks source link

organize tools list better #1669

Closed akiezun closed 6 years ago

akiezun commented 8 years ago

right now you get this which is bogus on many levels (duplicated and confusing categories, confusing tool names etc). We need to put more order into this. @vdauwera can you help come up with a better scheme of how to organize tools? Compare to the ADAM project (much much smaller scope of course but very clean UI: https://github.com/bigdatagenomics/adam)

/gatk-launch --list
Running:
    /Users/akiezun/IdeaProjects/gatk/build/install/gatk/bin/gatk --help
USAGE:  <program name> [-h]

Available Programs:
--------------------------------------------------------------------------------------
Copy Number Analysis:                            Tools to analyze copy number data.
    CalculateTargetCoverage                      Count overlapping reads target by target

--------------------------------------------------------------------------------------
Fasta:                                           Tools for analysis and manipulation of files in fasta format
    CreateSequenceDictionary                     Creates a dict file from reference sequence in fasta format
    NormalizeFasta                               Normalizes lines of sequence in a fasta file to be of the same length

--------------------------------------------------------------------------------------
Intervals:                                       Tools for processing intervals and associated overlapping records
    BedToIntervalList                            Converts a BED file to an Picard Interval List
    ExampleIntervalWalker                        Print intervals with optional contextual data
    IntervalListTools                            General tool for manipulating interval lists
    LiftOverIntervalList                         Lifts over an interval list between genome builds

--------------------------------------------------------------------------------------
QC:                                              Tools for Diagnostics and Quality Control
    AnalyzeCovariates                            Tool to analyze and evaluate base recalibration tables for BQSR
    CalculateHsMetrics                           Produces Hybrid Selection-specific metrics for a SAM/BAM file
    CollectAlignmentSummaryMetrics               Produces from a SAM/BAM/CRAM file containing summary alignment metrics
    CollectBaseDistributionByCycle               Produces metrics about nucleotide distribution per cycle in a SAM/BAM/CRAM file
    CollectGcBiasMetrics                         Produces metrics about GC bias in the reads in the provided SAM/BAM file
    CollectInsertSizeMetrics                     Produces metrics for insert size distribution for a SAM/BAM/CRAM file
    CollectJumpingLibraryMetrics                 Produces jumping library metrics for the provided SAM/BAMs
    CollectMultipleMetrics                       A "meta-metrics" calculating program that produces multiple metrics for the provided SAM/BAM file
    CollectOxoGMetrics                           Produces metrics quantifying the CpCG -> CpCA error rate from the provided SAM/BAM file
    CollectQualityYieldMetrics                   Produces metrics that quantify the quality and yield of sequence data from the provided SAM/BAM/CRAM file
    CollectRnaSeqMetrics                         Produces RNA alignment metrics for a SAM/BAM file
    CollectRrbsMetrics                           Produces metrics about bisulfite conversion for RRBS data
    CollectSequencingArtifactMetrics             Produces metrics to quantify single-base sequencing artifacts from a SAM/BAM file
    CollectTargetedPcrMetrics                    Produces Targeted PCR-related metrics given the provided SAM/BAM
    CollectWgsMetrics                            Produces metrics related to whole genome sequencing for a SAM/BAM file
    MeanQualityByCycle                           Produces metrics for mean quality by cycle for a SAM/BAM/CRAM file
    QualityScoreDistribution                     Produces metrics for quality score distributions for a SAM/BAM/CRAM file

--------------------------------------------------------------------------------------
SAM/BAM/CRAM:                                    Tools for manipulating read-level data (SAM/BAM/CRAM)
    AddCommentsToBam                             Adds comments to the header of a BAM file
    AddOrReplaceReadGroups                       Replaces read groups in a SAM/BAM/CRAM file with a single new read group
    ApplyBQSR                                    Applies the BQSR table to the input SAM/BAM/CRAM
    BaseRecalibrator                             Generates recalibration table for BQSR
    BuildBamIndex                                Generates a BAM index (.bai) file
    CalculateReadGroupChecksum                   Creates a hash code based on the read groups (RG) in the SAM/BAM/CRAM header
    CleanSam                                     Cleans the provided SAM/BAM/CRAM, soft-clipping beyond-end-of-reference alignments and setting MAPQ to 0 for unmapped reads
    ClipReads                                    Clip reads in a SAM/BAM/CRAM file
    CompareBaseQualities                         Compares base qualities of two input SAM/BAM/CRAM files
    CompareSAMs                                  Compares two input SAM/BAM/CRAM files
    CountBases                                   Count bases in a SAM/BAM/CRAM file
    CountReads                                   Count reads in a SAM/BAM/CRAM file
    DownsampleSam                                Down-sample a SAM/BAM file to retain a random subset of the reads
    EstimateLibraryComplexity                    Estimates library complexity from the sequence of read pairs
    ExampleReadWalkerWithReference               Print reads with reference context
    ExampleReadWalkerWithVariants                Print reads with overlapping variants
    FastqToSam                                   Converts a fastq file to an unaligned SAM/BAM file
    FilterReads                                  Creates a new SAM/BAM/CRAM file by including or excluding aligned reads
    FixMateInformation                           Ensure that all mate-pair information is in sync between each read and its mate pair
    FixMisencodedBaseQualityReads                Fix Illumina base quality scores in a SAM/BAM/CRAM file
    FlagStat                                     A reimplementation of the 'samtools flagstat' subcommand
    GatherBQSRReports                            Gathers scattered BQSR recalibration reports into a single file
    GatherBamFiles                               Concatenates one or more BAM files together as efficiently as possible
    LeftAlignIndels                              Left-aligns indels from reads in a SAM/BAM/CRAM file
    MarkDuplicates                               Examines aligned records in the supplied SAM/BAM/CRAM file to locate duplicate molecules.
    MergeBamAlignment                            Merges alignment data from a SAM/BAM with data in an unmapped SAM/BAM/CRAM file
    MergeSamFiles                                Merges multiple SAM/BAM files into one file
    PrintReads                                   Print reads in the SAM/BAM/CRAM file
    ReorderSam                                   Reorders reads in a SAM/BAM file to match ordering in reference
    ReplaceSamHeader                             Replace the SAMFileHeader in a SAM/BAM file with the given header
    RevertBaseQualityScores                      Revert Quality Scores in a SAM/BAM/CRAM file
    RevertOriginalBaseQualitiesAndAddMateCigar   Reverts the original base qualities and adds the mate cigar tag to read-group BAMs
    RevertSam                                    Reverts SAM/BAM files to a previous state
    SamFormatConverter                           Convert a SAM/BAM/CRAM file to a SAM/BAM/CRAM file
    SamToFastq                                   Converts a SAM/BAM file into a FASTQ
    SortSam                                      Sorts a SAM/BAM/CRAM file
    SplitNCigarReads                             Split Reads with N in Cigar
    SplitReads                                   Outputs reads from a SAM/BAM/CRAM by read group, sample and library name
    UnmarkDuplicates                             Unmark duplicates in a SAM/BAM/CRAM file
    ValidateSamFile                              Validates a SAM/BAM/CRAM file

--------------------------------------------------------------------------------------
Spark Validation tools:                          Tools written in Spark to compare aspects of two different files
    CompareBaseQualitiesSpark                    Diff qs of the BAMs
    CompareDuplicatesSpark                       Compares two BAMs for duplicates

--------------------------------------------------------------------------------------
Spark pipelines:                                 Pipelines that combine tools and use Apache Spark for scaling out (experimental)
    BQSRPipelineSpark                            Both steps of BQSR (BaseRecalibrator and ApplyBQSR) on Spark
    ReadsPipelineSpark                           Takes aligned reads (likely from BWA) and runs MarkDuplicates and BQSR. The final result is analysis-ready reads

--------------------------------------------------------------------------------------
Spark tools:                                     Tools that use Apache Spark for scaling out (experimental)
    ApplyBQSRSpark                               ApplyBQSR on Spark
    BaseRecalibratorSpark                        BaseRecalibrator on Spark
    BaseRecalibratorSparkSharded                 BaseRecalibrator on Spark (experimental sharded implementation)
    CollectBaseDistributionByCycleSpark          CollectBaseDistributionByCycle on Spark
    CollectQualityYieldMetricsSpark              CollectQualityYieldMetrics on Spark
    CountBasesSpark                              CountBases on Spark
    CountReadsSpark                              CountReads on Spark
    CountVariantsSpark                           CountVariants on Spark
    CreateHadoopBamSplittingIndex                create a hadoop-bam splitting index
    FindBadGenomicKmersSpark                     find ref kmers with high copy number
    FindSVBreakpointsSpark                       Produce small FASTQs of reads sharing kmers with putative SV breakpoints for local assembly
    FlagStatSpark                                FlagStat on Spark
    MarkDuplicatesSpark                          MarkDuplicates on Spark
    MeanQualityByCycleSpark                      MeanQualityByCycle on Spark
    PrintReadsSpark                              PrintReads on Spark
    QualityScoreDistributionSpark                QualityScoreDistribution on Spark
    SortReadFileSpark                            SortSam on Spark (works on SAM/BAM/CRAM)

--------------------------------------------------------------------------------------
Spark tools for structural variation analysis:   Structural variation analysis tools that use Apache Spark for scaling out (experimental)
    CollectInsertSizeMetricsSpark                Collect Insert Size Distribution on Spark

--------------------------------------------------------------------------------------
VCF:                                             Tools for manipulating variants and associated metadata
    CountVariants                                Count variants in a VCF file
    ExampleVariantWalker                         Example tool that prints variants with optional contextual data
    FilterVcf                                    Hard-filters a VCF file
    GatherVcfs                                   Gathers multiple VCF files from a scatter operation into a single VCF file
    GenotypeConcordance                          Calculates the concordance between genotype data for two samples in two different VCFs
    IndexFeatureFile                             Creates indices for Feature-containing files (eg VCF and BED files)
    LiftOverVcf                                  Lifts a VCF between genome builds
    MakeSitesOnlyVcf                             Creates a VCF bereft of genotype information from an input VCF
    MergeVcfs                                    Merges multiple VCF files into one VCF file
    RenameSampleInVcf                            Rename a sample within a VCF
    SelectVariants                               Select a subset of variants from a larger callset in a VCF file
    SortVcf                                      Sorts one or more VCF files
    SplitVcfs                                    Splits an input VCF file into two VCF files
    ValidateVariants                             Validate VCF
    VariantFiltration                            Hard-filter variants VCF (mark them as FILTER)
    VariantsToTable                              Extract specific fields from a VCF file to a tab-delimited table
    VcfToIntervalList                            Converts a VCF file to a Picard Interval List

--------------------------------------------------------------------------------------
droazen commented 7 years ago

For @cmnbroad and the 4.0 release milestone.

sooheelee commented 6 years ago

Let me know if and how we can help in this @cmnbroad. Just to keep you updated, Sheila and I are spending the next three weeks updating GATK4 documentation. We aim for all documentation to be complete by December 14. It's possible we will organize the forum tool documentation to mirror that of this categorization. There was talk today in a meeting with the engine team that we should propose multiple categorizations, for the (much later, not upcoming release) possibility for listing tools based on user preference, e.g. by functional category or by input file type. So any work you do towards categorization could be useful down the road.

Here are some questions that were raised during the meeting:

Because we need to also ensure every tool (excepting Spark and BWA) has a summary description and example command, it would be useful for documentation purposes at least, to have functional categories that take the same or related inputs, e.g. any type of interval list (BED, Picard style or GATK), so that we can group testing of the tool commands and there is minimal context switching.

sooheelee commented 6 years ago

Turns out categorization is on the Comms team--I misunderstood. @cmnbroad you are only on the hook for labels and other engineering feats.

sooheelee commented 6 years ago

This is being finalized via efforts in https://github.com/broadinstitute/gatk/issues/3853 and https://github.com/broadinstitute/dsde-docs/issues/2639. Whether the new categorization schema is better, well, no one has complained so far.

droazen commented 6 years ago

This is done -- closing.