gagneurlab / MMSplice_MTSplice

Tissue-specific variant effect predictions on splicing
MIT License
40 stars 21 forks source link
machine-learning splicing variant-effect-prediction vep-plugin

MMSplice & MTSplice

CircleCI pypi

Predict (tissue-specific) splicing variant effect from VCF. MTSplice is integrated into MMSplice with the same API.

Paper: Cheng et al. https://doi.org/10.1101/438986, https://www.biorxiv.org/content/10.1101/2020.06.07.138453v1

MMSplice MTSplice

Installation


External dependencies:

pip install cyvcf2 cython

Conda installation is recommended:

conda install cyvcf2 cython -y
pip install mmsplice

Run MMSplice Online

You can run mmsplice with following google colab notebooks online:

Preparation


1. Prepare annotation (gtf) file

Standard human gene annotation file in GTF format can be downloaded from ensembl or gencode. MMSplice can work directly with those files, however, some filtering is higly recommended.

2. Prepare variant (VCF) file

A correctly formatted VCF file with work with MMSplice, however the following steps will make it less prone to false positives:

3. Prepare reference genome (fasta) file

Human reference fasta file can be downloaded from ensembl/gencode. Make sure the chromosome name matches with GTF annotation file you use.

Example code


Check notebooks/example.ipynb

To score variants (including indels), we suggest to use primarily the deltaLogitPSI predictions, which is the default output. The differential splicing efficiency (dse) model was trained from MMSplice modules and exonic variants from MaPSy, thus only the predictions for exonic variants are calibrated.

MTSplice To predict tissue-specific variant effect with MTSplice, specify tissue_specific=True in SplicingVCFDataloader.

# Import
from mmsplice.vcf_dataloader import SplicingVCFDataloader
from mmsplice import MMSplice, predict_save, predict_all_table
from mmsplice.utils import max_varEff

# example files
gtf = 'tests/data/test.gtf'
vcf = 'tests/data/test.vcf.gz'
fasta = 'tests/data/hg19.nochr.chr17.fa'
csv = 'pred.csv'

Dataloader to load variants from vcf

dl = SplicingVCFDataloader(gtf, fasta, vcf, tissue_specific=False)

To predict tissue-specific effect, in the dataloader use tissue_specific=True in the dataloader instead

dl = SplicingVCFDataloader(gtf, fasta, vcf, tissue_specific=True)

Run prediction with default MMSplice parameters

# Specify model
model = MMSplice()

# Or predict and return as df
predictions = predict_all_table(model, dl, pathogenicity=True, splicing_efficiency=True)

To predict variant effect on scale instead of . This option only works with tissue specific predictions dl = SplicingVCFDataloader(..., tissue_specific=True):

# Or predict and return as df
predictions = predict_all_table(model, dl, natural_scale=True)

One variant might map to multiple exons. In the end we summarize the effect of as the maximum across all exons.

# Summerize with maximum effect size
predictionsMax = max_varEff(predictions)

Output

Output of MMSplice is an tabular data which contains following described columns:

VEP Plugin

The VEP plugin wraps the prediction function from mmsplice python package. Please check documentation of vep plugin under VEP_plugin/README.md.