DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies

A Python framework for the analysis of GWAS data with special focus on explainable artificial intelligence.

This repository contains an implementation of the DeepCOMBI method from here . DeepCOMBI is a neural-network-based method to identify SNP trait associations in GWAS datasets. It is an extension of COMBI, an SVM based GWAS tool, which is described here.

This software package also contains methods for generating artificial GWAS data to analyze with DeepCOMBI.

Developed by Alexandre Rozier and Bettina Mieth.

Publication

The Python framework and this website are part of a publication currently under peer-review at Nucleic Acids Research. The pre-print article is available here. Links to the published paper will be included here once available.

Abstract

Deep learning has revolutionized data science in many fields by greatly improving prediction performances in comparison to conventional approaches. Recently, explainable artificial intelligence has emerged as a novel area of research that goes beyond pure prediction improvement by extracting knowledge from deep learning methodologies through the interpretation of their results. We investigate such explanations to explore the genetic architectures of phenotypes in genome-wide association studies. Instead of testing each position in the genome individually, the novel three-step algorithm, called DeepCOMBI, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layerwise relevance propagation as one example from the pool of explanation techniques. The resulting importance scores are eventually used to determine a subset of most relevant locations for multiple hypothesis testing in the third step. The performance of DeepCOMBI in terms of power and precision is investigated on generated datasets and a 2007 study. Verification of the latter is achieved by validating all findings with independent studies published up until 2020. DeepCOMBI is shown to outperform ordinary raw p-value thresholding and other baseline methods. Two novel disease associations (rs10889923 for hypertension, rs4769283 for type 1 diabetes) were identified.

How to run DeepCOMBI

Replicating experiments

In the course of our research (from Mieth et al. ) we have investigated the performance of the proposed method in comparison with the most important baseline methods firstly in a simulation study on generated data and secondly on real data (Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 447(7145), 661–678.) To fully reproduce the experiments of our study, please follow the corresponding instructions for the application of DeepCOMBI on both generated and real datasets.

On generated synthetic datasets

Run ROOT_DIR=$PWD SGE_TASK_ID=1 python -m pytest -s tests/test_data_generation.py::TestDataGeneration::test_synthetic_genotypes_generation --rep 1000 to generate rep different genotypes that will be saved in data/synthetic/genomic.h5py. Please note, that to generate datasets you need two real datasets to sample from. We use the WTCCC data and randomly select 300 subjects of the Crohn's disease dataset. We draw a random block of 20 consecutive SNPs from chromosome 1 and a random block of 10,000 consecutive SNPs from chromosome 2. The process is described in detail in our manuscript on page 6. Unfortunately, we are not authorized to publish this data and you will have to save your own datasets in the corresponding .mat files. The .mat files should be simple arrays of characters where the number of rows equals the number of subjects and the number of columns equals the number of SNPs * 3 (two letters for the genotype and one space). A small part of it with three subjects and the genotypes of four SNPs given would look like this:

AA AA CG GG
AT AA GG GG
TT AT CC GT

Converting your own Plink files should be straightforward.

Run ROOT_DIR=$PWD SGE_TASK_ID=1 python -m pytest -s tests/test_data_generation.py::TestDataGeneration::test_feature_map_generation to generate the features matrices associated to the genomic datasets previously created in data/synthetic/genomic.h5py and saves them in data/synthetic/2d_fm.h5py and data/synthetic/3d_fm.h5py
To create Figure 2 of the paper run ROOT_DIR=$PWD SGE_TASK_ID=1 python -m pytest -s tests/test_deepcombi.py::TestDeepCOMBI::test_lrp_svm --rep 1 to plot three exemplary runs. It will be saved in img_dir/.
To create Figure 3 of the paper run ROOT_DIR=$PWD SGE_TASK_ID=1 python -m pytest -s tests/test_deepcombi.py::TestDeepCOMBI::test_tpr_fwer_alex --rep 1000 to plot the performance curves of DeepCOMBI and its competitors.
To generate Table 1 of the paper run ROOT_DIR=$PWD SGE_TASK_ID=1 python -m pytest -s tests/test_deepcombi.py::TestDeepCOMBI::test_svm_cnn_comparison_alex --rep 1000 to investigate the prediction accuracies of the SVM and the DNN on the generated datasets.

On your own dataset or the 2007 WTCCC dataset

The data should be saved in the folder data/. The .mat files should be simple arrays of characters where the number of rows equals the number of subjects and the number of columns equals the number of SNPs * 3 (two letters for the genotype and one space). A small part of it with three subjects and the genotypes of four SNPs given would look like this: