Use drug and gene embeddings to predict drug-gene interactions, with a focus on Africa.
The knowledge graph leverages mainly PharmGKB and the Bioteque:
To capture the main pharmacogenomic relationships and concepts in our KG, we have adapted the structure of PharmGKB and the light-weight pharmacogenomics ontology (PGxO). The below graph illustrates the main concepts captured in the KG, as well as the edges that relate the different concepts. The central piece of the KG is the pharmacogenomic relationship, which, ideally, is able to describe the phenotypic effect a variant has on a specific drug. This interactions can range from well-proven (evidence levels 0 to 3), annotated relations to automated literature searches. In Yellow, we depict the Base Tables, which build upon PharmGKB's database. In Green, we depict relationship tables we have built based on the Base tables. The PGX Relationship table draws from all of them to integrate the different concepts.
graph LR
A(Variant):::YellowClass --> C(Genomic Variation):::GreenClass
B(Haplotype):::YellowClass --> C(Genomic Variation)
C --> D(Genetic Factor):::GreenClass
E(Gene):::YellowClass --> D
F(Chemicals):::YellowClass --> G(Pharmacogenomic Relationship):::OrangeClass
D --> G
H(Evidence):::YellowClass --> G
I(Biogeographical Group):::YellowClass --> G
G --> J(Phenotype):::GreenClass
J --> K(Pharmacodynamic phenotype):::YellowClass
J --> L(Pharmacokinetic phenotype):::YellowClass
J --> M(Disease):::YellowClass
classDef OrangeClass fill:#faa08c,stroke:#50285a,stroke-width:2px
classDef GreenClass fill:#bee6b4,stroke:#50285a,stroke-width:2px
classDef YellowClass fill:#fad782,stroke:#50285a,stroke-width:2px
Since PharmGKB is the major source of information for the KG, below we detail a few of the assumptions made during data curation and processing, to build the tables that can be found in the corresponding folder. This is simply an overview of the data curation process, please refer to the code for more details.
These tables are merged into the PGX relationship table. At this moment, the PGX mainly encompasses the following information: annotation ID (unique identifier of the relationship described), genomic variation (variant or haplotype) and its ID, gene and its ID, chemical and its ID, pk and pd phenotype, evidence level, association (significance of the association Yes: 1 / No: -1 / Inconclusive: 0) and biogeograhical group.
LLMs are used to re-rank and offer an explanation of the predictions. We mainly used gpt-4-turbo
.
assets
: graphical assets for the projectdata
: contains all the .csv and .tsv files, including those directly downloaded from open repositoriesscripts
: contains .py scripts analyse the datanotebooks
: testing notebooks and examplessrc
: source code for the building of the KGPlease note that several folders will be generated when you run the pipeline below, including embeddings
, models
and results
folders. Also, please note that the data
folder will be modified.
scripts/0_process_pharmgkb
: scripts/1_preparation
:scripts/2_pairs
to scripts/3_llm_assessment
to rerank drug-gene pair predictions based on large language models.scripts/4_post_analyses
to produce analyses on the results, including statistics and plots.The code in this repository is licensed under a GNU General Public License v3.0. The data comes from public repositories and is limited to the licenses stated by the original data producers.
The Ersilia Open Source Initiative is a Non Profit Organization (1192266) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.