AlexandreRozier / DeepCombi

A repository for the DeepCombi project
9 stars 3 forks source link

DeepCOMBI: Explainable artificial intelligence for the analysis and discovery in genome-wide association studies

A Python framework for the analysis of GWAS data with special focus on explainable artificial intelligence.

This repository contains an implementation of the DeepCOMBI method from here . DeepCOMBI is a neural-network-based method to identify SNP trait associations in GWAS datasets. It is an extension of COMBI, an SVM based GWAS tool, which is described here.

This software package also contains methods for generating artificial GWAS data to analyze with DeepCOMBI.

Developed by Alexandre Rozier and Bettina Mieth.

Publication

The Python framework and this website are part of a publication currently under peer-review at Nucleic Acids Research. The pre-print article is available here. Links to the published paper will be included here once available.

Abstract

Deep learning has revolutionized data science in many fields by greatly improving prediction performances in comparison to conventional approaches. Recently, explainable artificial intelligence has emerged as a novel area of research that goes beyond pure prediction improvement by extracting knowledge from deep learning methodologies through the interpretation of their results. We investigate such explanations to explore the genetic architectures of phenotypes in genome-wide association studies. Instead of testing each position in the genome individually, the novel three-step algorithm, called DeepCOMBI, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layerwise relevance propagation as one example from the pool of explanation techniques. The resulting importance scores are eventually used to determine a subset of most relevant locations for multiple hypothesis testing in the third step. The performance of DeepCOMBI in terms of power and precision is investigated on generated datasets and a 2007 study. Verification of the latter is achieved by validating all findings with independent studies published up until 2020. DeepCOMBI is shown to outperform ordinary raw p-value thresholding and other baseline methods. Two novel disease associations (rs10889923 for hypertension, rs4769283 for type 1 diabetes) were identified.

How to run DeepCOMBI

Replicating experiments

In the course of our research (from Mieth et al. ) we have investigated the performance of the proposed method in comparison with the most important baseline methods firstly in a simulation study on generated data and secondly on real data (Wellcome Trust Case Control Consortium (2007) Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature. 447(7145), 661–678.) To fully reproduce the experiments of our study, please follow the corresponding instructions for the application of DeepCOMBI on both generated and real datasets.

On generated synthetic datasets

AA AA CG GG
AT AA GG GG
TT AT CC GT

Converting your own Plink files should be straightforward.

On your own dataset or the 2007 WTCCC dataset

AA AA CG GG
AT AA GG GG
TT AT CC GT

Converting your Plink files should be straightforward.