asoltis / MutEnricher

Somatic coding and non-coding mutation enrichment analysis for tumor WGS data
Other
9 stars 3 forks source link

MutEnricher


Author: Anthony R. Soltis (anthony.soltis.ctr@usuhs.edu, anthonyrsoltis@gmail.com)

Institution: Uniformed Services University of the Health Sciences, Bethesda, MD

License: MIT License, see License

Version: 1.3.3

Introduction:

MutEnricher is a flexible toolset that performs somatic mutation enrichment analysis of both protein-coding and non-coding genomic loci from whole genome sequencing (WGS) data, implemented in Python and usable with Python 2 and 3.

MutEnricher is now also available as a Docker image.

MutEnricher contains two distinct modules:

  1. coding - for performing somatic enrichment analysis of non-silent variation in protein-coding genes
  2. noncoding - for performing enrichment analysis of non-coding regions

The main driver script is mutEnricher.py and each tool can be evoked from here, i.e.:

  1. python mutEnricher coding ...
  2. python mutEnricher noncoding ...

See help pages and associated documentation for methodological and run details.

Citation:

A MutEnricher manuscript is now published in BMC Bioinformatics. Please cite if using this software:

Soltis, A.R., Dalgard, C.L., Pollard, H.B., & Wilkerson, M.D. MutEnricher: a flexible toolset for somatic mutation enrichment analysis of tumor whole genomes. BMC Bioinformatics (2020). 20(1).

Info and User Guides:

Wiki

Quickstart guide

Tutorial

Output file descriptions

Installation:

See Installation Guide section on Wiki.

Additional utilities

In the "utilities" sub-directory, we include two helper functions for generating covariate files for use with MutEnricher's covariate clustering functions:

1. get_gene_covariates.py  
2. get_region_covariates.py

See the help pages for example usage. (1) above requires GTF input (as for the coding module) and (2) requires and input BED (as for the noncoding module). Both also require a copy of an indexed genome FASTA file (e.g. for hg19/hg38 human genomes) as input.

Example data

We include various example files for testing MutEnricher on synthetic somatic data. See the "example_data" sub-folder.

Several quickstart commands are provided in example_data/quickstart_commands.txt file. A sample quickstart command for coding analysis:

cd example_data
python ../mutEnricher.py coding annotation_files/ucsc.refFlat.20170829.no_chrMY.gtf.gz vcf_files.txt --anno-type nonsilent_terms.txt -o test_out_coding --prefix test_global

Files/folders contained in example_data:

  1. example_data/annotation_files

    Contains example GTF and BED files for running MutEnricher's coding and noncoding modules.

    • ucsc.refFlat.20170829.no_chrMY.gtf.gz
    • ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed

    NOTE: Input GTF (coding analysis) and BED files (noncoding analysis) can be gzip compressed or not.

  2. example_data/covariates

    Contains example covariate and covariate weights files for running the covariate clustering background method:

    For coding:

    • ucsc.refFlat.20170829.no_chrMY.covariates.txt
    • ucsc.refFlat.20170829.no_chrMY.covariate_weights.txt

    For noncoding:

    • ucsc.refFlat.20170829.promoters_up1kb_down200.no_chrMY.covariates.txt
    • ucsc.refFlat.20170829.promoters_up1kb_down200.no_chrMY.covariate_weights.txt
  3. nonsilent_terms.txt

    Example non-silent terms file for use with coding module. This example is applicable to VCFs annotated with ANNOVAR refGene models (the sample VCFs are annotated in this way). Use with the --anno-type option in the coding module.

    NOTE: These same terms will be used if "annovar" is passed to the --anno-type option.

  4. precomputed_apcluster

    This folder provides pre-computed affinity propagation results for the datasets in (1) and (2) above. These directories can be supplied to MutEnricher via the --precomputed-covars option.

    For coding (all genes):

    • coding.ucsc.refFlat.20170829.no_chrMY/all_genes

    For noncoding:

    • noncoding.ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY/apcluster_regions
  5. quickstart_commands.txt

    Sample execution commands (associated with quickstart guide).

  6. vcf_files.txt

    Sample VCF input files list file. This file contains local paths and assumes working directory is "example_data" sub-directory.

  7. vcfs

    Sub-directory containing 100 synthetic somatic VCF files (compressed with index .tbi files). These files were generated by randomly inserting "somatic mutations" at positions in the hg19 genome at a target rate of ~2 mutations/Mb. Three true positive cases are included, two coding and one non-coding, whereby non-silent mutations were inserted into the TP53 and KRAS genes and somatic mutations were inserted into the TERT gene promoter region.

Change log


06-15-2021

05-11-2021

10-01-2020

06-10-2020

10-23-2019

10-10-2019

09-13-2019

03-25-2019

02-12-2019

01-15-2019

06-15-2018