This project aims at predicting side effects of drugs based on their chemical and structural features, their interactions with the (metabolic) network of genes, and their (structure) similarity relationships. Drugs and genes are represented as nodes of a heterogeneous graph. Gene-gene interactions, drug-gene interactions, and drug-drug similarity relationships are represented as distinct sets of edges in the graph. The associations between drugs and side-effects are determined through a node multilabel classification task, where drug nodes are classified over the set of side-effects they are associated to. Drug features include chemical descriptors and molecular fingerprints (vectors describing the functional groups that appear in each drug). Gene features include chromosome and strand information and the molecular function ontology. Transductive learning experiments were carried out: in this framework, side-effect supervisions were exploited as transductive features, in order to better train the network for the task it has to solve (finding side-effects of new drugs, given side-effects of drugs which were observed in the past).
In the output folder of the dataset, you can find some example files. They are named Soglia_X and correspond to outputs obtained on different versions of the dataset. Each of this versions is obtained with a different threshold on the minimum frequency of side-effects. A file named Soglia_X is always obtained by considering only the side effects which are associated to at least X drugs in SIDER. For instance, Soglia_50 is obtained by considering only side-effects that are associated to at least 50 drugs in SIDER.
This project was published on the IEEE/ACM Transactions on Computational Biology and Bioinformatics.
You can find the paper here: https://ieeexplore.ieee.org/abstract/document/9775571
If you use this work for any public project, please cite:
P. Bongini, F. Scarselli, M. Bianchini, G. M. Dimitri, N. Pancino and P. Lió, "Modular Multi–Source Prediction of Drug Side–Effects With DruGNN," in IEEE/ACM Transactions on Computational Biology and Bioinformatics, vol. 20, no. 2, pp. 1211-1220, 1 March-April 2023, doi: 10.1109/TCBB.2022.3175362
Niccolò Pancino et al. Graph Neural Networks for the Prediction of Protein-Protein Interfaces, 28th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning.
Work in progress
Thomas Kipf, Max Welling. Semi-supervised classification with Graph Convolutional Networks, International Conference on Learning Representations, 2017.
Daniele Grattarola, Cesare Alippi. Graph Neural Networks in TensorFlow and Keras with Spektral. International Conference on Machine Learning, Graph Representation Learning workshop, 2020.
Martín Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
The graph was built according to data coming from multiple sources.
Katja Luck et al. A reference map of the human binary protein interactome. Nature 580: 402-408, 2020.
Damian Smedley et al. The BioMart community portal: an innovative alternative to large centralized data repositories. Nucleic Acids Research 43(W1):W589-W598, 2015.
Michael Kuhn et al. Interaction networks of chemical and proteins. Nucleic Acids Research 36(Suppl_1):D684-D688, 2008.
Michael Kuhn et al. The SIDER database of drugs and side effects. Nucleic Acids Research 44(D1):D1075-D1079, 2016.
Sunghwan Kim et al. PubChem in 2021: new data content and improved web interfaces. Nucleic Acids Research 49(D1):D1388–D1395, 2021. doi:10.1093/nar/gkaa971
Drug-drug links were based on the Tanimoto similarity of chemical fingerprints of drugs, obtained with RdKit. Fingerprints were also used as additional drug features: https://www.rdkit.org/
Molecular function features, based on the Gene Ontology, were extracted with DAVID: http://geneontology.org/ https://david.ncifcrf.gov/tools.jsp
Michael Ashburner et al. Gene ontology: tool for the unification of biology. Nature Genetics 25(1):25-9, 2000.
Glynn Dennis Jr. et al. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 4:R60, 2003.
As some of the data sources are too large to fit in a GitHub repository, you need to download some of them before running DruGNN.