SVFX is a machine learning based tool to assign pathogenic scores to large deletions and duplications.
Install conda
using these directions.
On MacOS, you can activate the provided environment with
conda env create -f SVFXenv.yml && conda activate SVFXenv
otherwise, run
conda env create -f SVFXenv_all_platforms.yml && conda activate SVFXenv
The SVFX pipeline requires Python3.6+.
The pipeline is built using Snakemake; to configure
elements of the pipeline, modify the config.yaml
file from the root directory.
run_id
: A string to identify this run; this string will be prepended to the files generated by this run
so that multiple runs can be executed without cleanup.
coordinates
: control
and disease
should have the file paths for the control
and disease
coordinates, respectively.
bigwig_files
: A hyphenated list of paths to BigWig files to be used in
feature computation.
overlap_coordinate_files
: A hyphenated list of paths to coordinate files for computing overlap as a feature (e.g. coordinates of certain genes)
normalized
: Boolean flag for whether to length-normalize the feature values.
variant_type
: What kind of variants are in the input files (either DEL
or DUP
)
ref_genome
: Index file (*.fai
) for a reference genome
randomized_num
: If normalized
is True
, the number of pseudo-matrices of variants to normalize with.
output_matrix
: control
and disease
should have the root output names for the control feature matrix and the disease feature matrix, respectively (these will be combined into a file with the root name combined_matrix
).
columns_to_remove
: File with indices for the columns in the feature matrix that should not be considered in the model (e.g. the indices of the
feature coordinates). The default file should be fine for the matrices generated from the scripts we provide.
class_label_index
: Column with the class label. Again, the default value should be fine for matrices generated from our scripts.
num_trees
: Number of trees in the random forest
max_depth
: The maximum depth of a tree in the random forest
min_samples_split
: The minimum number of samples to split an internal node in a tree in the random forest
Simply run snakemake --cores=8
from the root directory.
To just run a part of the pipeline, modify config.yaml
and Snakefile
with the configuration and input/output you want, then run snakemake --cores=8 -R --until (rule name)
.
For example, if you have a list of coordinates and just want to generate the feature matrix for it (which is useful if you want to use a preexisting model to score additional variants), you can modify Snakefile
to set the control or disease input to your file (for the other input, you can just put an arbitrary coordinates file, then delete the output matrix corresponding to that input). You can then run snakemake --cores=8 -R --until generate_feature_matrix
, and the matrix will be generated.
Usage of the individual programs in the pipeline is explained below.
pip3 install scikit-allel
pip3 install pyBigWig
pip3 install argparse
pip3 install sklearn
generate_feature_matrix.py
Usage: python3 src/generate_feature_matrix.py -c [SV files in tab-separated coordinate format (each line should start with chr)] -b [Feature files in BigWig format] -g [Coordinates for interval overlap-based features] -o [File root for output] -f [Present if class label 1, False otherwise] -t [SV type (e.g. DEL, DUP, INV)] -z [Present if matrix should be Z-Score normalized (length feature will not be normalized)] -r [Number of shuffled coordinates to generate for the randomization process] -rg [Reference genome index file - necessary for the -r flag] -l [If present, length will be included as a feature (and will be ignored during normalization)]
Sample command (run from the root directory): python3 src/generate_feature_matrix.py -c data/sample_input_SV.txt -b data/gcPercentage.bw -g data/gc19_pc.prom.nr.bed data/gc19_pc.3utr.nr.bed data/gc19_pc.5utr.nr.bed data/gc19_pc.cds.nr.bed data/gc19_pc.ss.nr.bed data/sensitive.nc.bed data/ultra.conserved.hg19.bed data/wgEncodeBroadHmmGm12878HMM.Heterochrom.bed data/H1-ESC_Dixon2015-raw_TADs.bed -o sample_output_matrix -t DEL -z -r 1 -rg data/hg19.fa.fai
Given an input list of structural variants, generates a matrix of feature values calculated from the given BigWig and coordinate-overlap files for input to a model.
rf_model.py
Usage: python3 src/rf_model.py -i [input feature matrix] -d [File of indices of features to ignore (one per line)] -t [Index of the class label in the feature matrix] -n [Number of trees in the forest] -m [Maximum depth of a tree in the forest] -s [Minimum number of samples required to split an internal node] -c [Number of pathogenic SV's in the matrix] -l [Total number of SV's in the matrix] -o [Output root filename]
Sample command: python3 src/rf_model.py -i input_del_matrix.ZscoreNormalized.training.tsv -d remove_indices.txt -t 0 -n 150 -c 3613 -l 7225 -o output_model_
Given an input feature matrix, trains a set of 10 random forest classifiers on disjoint tenths of the dataset, then predicts class labels for each member of the dataset with the models that did not train on them. Saves the created models, as a Python list of trained sklearn models, to [file root]_ten_models.pkl
. Saves the indices used to split the diseased data into tenths in [file root]_disease_indices.pkl
, and those used to split the control data in [file root]_kg_indices.pkl
.
Predictions on the training data (from the models that did not train on each specific SV) are outputted in dictionary format to [file root]_predictions.pkl
. Finally, AU(ROC/PRC) on this data are calculated
and outputted to [file root]_ROC_PRC.txt
.
Note: The input matrix must be formatted in such a way that all control group rows follow all non-control rows (in other words, all rows with label 0 follow all rows with label 1). The easiest way to do this is to use generate_feature_matrix.py
to generate the 0-label and 1-label matrices, then append the 0-label matrix (without title row) to the 1-label matrix.
src/load_model.py
Tests the model on a matrix. Note that the matrix must have the same format as the matrices used to train the model (every single feature in the same order as the training data, with no additional features). To generate a matrix given coordinates, see (here)[###running-parts-of-the-pipeline].
You will need to pip3 install joblib
before this can be run. If any dependencies are still missing, install them with pip3.
Usage: python3 src/load_model.py -i [input matrix] -d [File of columns to ignore for model training; the same as columns_to_remove in the config file. remove_inds.txt should be find for matrices generated from our scripts.] -m [the .pkl file with all 10 models. This is output by the Snakemake pipeline] -t [Index of the class label in the matrix - use 0 if the matrix was generated using our code] -o [filename to write the predictions to]
roc_prc_gen.py
Usage: python3 roc_prc_gen.py [input file] [output file]
Given a tab-separated two-column input file of predicted scores and true labels (with a header, as with the following example):
Predicted True
0.755 1
...
0.211 0
Outputs AU(ROC/PRC) information, along with a plot of the ROC, saved to the specified output file.