bioinform / somaticseq

An ensemble approach to accurately detect somatic mutations using SomaticSeq
http://bioinform.github.io/somaticseq/
BSD 2-Clause "Simplified" License
189 stars 53 forks source link
cancer-genomics somatic-variants

SomaticSeq

SomaticSeq is an ensemble somatic SNV/indel caller that has the ability to use machine learning to filter out false positives from other callers. The detailed documentation is located in docs/Manual.pdf.

Training data for benchmarking and/or model building

In 2021, the FDA-led MAQC-IV/SEQC2 Consortium has produced multi-center multi-platform whole-genome and whole-exome sequencing data sets for a pair of tumor-normal reference samples (HCC1395 and HCC1395BL), along with the high-confidence somatic mutation call set. This work was published in Fang, L.T., Zhu, B., Zhao, Y. et al. Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing. Nat Biotechnol 39, 1151-1160 (2021) / PMID:34504347 / Free Read-Only Link. The following are some of the use cases for these resources:

Click for more details of the SEQC2's somatic mutation project.

Recommendation of how to use SEQC2 data to create SomaticSeq classifiers.


Briefly explaining SomaticSeq v1.0 SEQC2 somatic mutation reference data and call sets How to run SomaticSeq v3.6.3 on precisionFDA
Run in train or prediction mode

Installation

Dependencies

This dockerfile reveals the dependencies

To install using pip

Make sure to install bedtools separately.

pip install somaticseq

To install the bioconda version

SomaticSeq can also be found on Anaconda-Server Badge. To install with bioconda, which also automatically installs a bunch of 3rd-party somatic mutation callers:

conda install -c bioconda somaticseq

To install from github source with conda

conda create --name my_env -c bioconda python bedtools
conda activate my_env
git clone git@github.com:bioinform/somaticseq.git
cd somaticseq
pip install -e .

Test your installation

There are some toy data sets and test scripts in example that should finish in <1 minute if installed properly.

Run SomaticSeq with an example command

# Merge caller results and extract SomaticSeq features
somaticseq_parallel.py \
  --output-directory  $OUTPUT_DIR \
  --genome-reference  GRCh38.fa \
  --inclusion-region  genome.bed \
  --exclusion-region  blacklist.bed \
  --threads           24 \
paired \
  --tumor-bam-file    tumor.bam \
  --normal-bam-file   matched_normal.bam \
  --mutect2-vcf       MuTect2/variants.vcf \
  --varscan-snv       VarScan2/variants.snp.vcf \
  --varscan-indel     VarScan2/variants.indel.vcf \
  --jsm-vcf           JointSNVMix2/variants.snp.vcf \
  --somaticsniper-vcf SomaticSniper/variants.snp.vcf \
  --vardict-vcf       VarDict/variants.vcf \
  --muse-vcf          MuSE/variants.snp.vcf \
  --lofreq-snv        LoFreq/variants.snp.vcf \
  --lofreq-indel      LoFreq/variants.indel.vcf \
  --scalpel-vcf       Scalpel/variants.indel.vcf \
  --strelka-snv       Strelka/variants.snv.vcf \
  --strelka-indel     Strelka/variants.indel.vcf \
  --arbitrary-snvs    additional_snv_calls_1.vcf.gz additional_snv_calls_2.vcf.gz ... \
  --arbitrary-indels  additional_indel_calls_1.vcf.gz additional_indel_calls_2.vcf.gz ...

Additional parameters to be specified before paired option to invoke training mode. In addition to the four files specified above, two classifiers (SNV and indel) will be created..

Additional input files to be specified before paired option invoke prediction mode (to use classifiers to score variants). Four additional files will be created, i.e., SSeq.Classified.sSNV.vcf, SSeq.Classified.sSNV.tsv, SSeq.Classified.sINDEL.vcf, and SSeq.Classified.sINDEL.tsv.

Without those paramters above to invoking training or prediction mode, SomaticSeq will default to majority-vote consensus mode.

Do not worry if Python throws the following warning. This occurs when SciPy attempts a statistical test with empty data, e.g., z-scores between reference- and variant-supporting reads will be nan if there is no reference read at a position.

  RuntimeWarning: invalid value encountered in double_scalars
  z = (s - expected) / np.sqrt(n1*n2*(n1+n2+1)/12.0)

To train for SomaticSeq classifiers with multiple data sets

Run somatic_xgboost.py train --help to see the options, e.g.,

somatic_xgboost.py train \
  -tsvs SAMPLE_1/Ensemble.sSNV.tsv SAMPLE_2/Ensemble.sSNV.tsv ... SAMPLE_N/Ensemble.sSNV.tsv \
  -out multiSample.SNV.classifier \
  -threads 8 -depth 12 -seed 42 -method hist -iter 250 \
  --extra-params scale_pos_weight:0.1 grow_policy:lossguide max_leaves:12

Run SomaticSeq modules seperately

Most SomaticSeq modules can be run on their own. They may be useful in debugging context, or be run for your own purposes. See this page for your options.

Dockerized workflows and pipelines

To run somatic mutation callers and then SomaticSeq

We have created a module (i.e., makeSomaticScripts.py) that can run all the dockerized somatic mutation callers and then SomaticSeq, described at somaticseq/utilities/dockered_pipelines. There is also an alignment workflow described there. You need docker to run these workflows. Singularity is also supported, but is not optimized. Let me know if you find bugs.

To create training data to create SomaticSeq classifiers

Dockerized alignment pipeline based on GATK's best practices

Described at somaticseq/utilities/dockered_pipelines. The module is makeAlignmentScripts.py.

Utilities

We have some generally useful scripts in utilities. Some of the more useful tools, e.g.,