ersilia-os / pharmacogx-embeddings

Pharmacogenomics knowledge graph embeddings and related analyses
GNU General Public License v3.0
3 stars 0 forks source link

PharmacoGx Embeddings

Use drug and gene embeddings to predict drug-gene interactions, with a focus on Africa.

Project overview

Project overview

Knowledge graph

The knowledge graph leverages mainly PharmGKB and the Bioteque:

Ontology

To capture the main pharmacogenomic relationships and concepts in our KG, we have adapted the structure of PharmGKB and the light-weight pharmacogenomics ontology (PGxO). The below graph illustrates the main concepts captured in the KG, as well as the edges that relate the different concepts. The central piece of the KG is the pharmacogenomic relationship, which, ideally, is able to describe the phenotypic effect a variant has on a specific drug. This interactions can range from well-proven (evidence levels 0 to 3), annotated relations to automated literature searches. In Yellow, we depict the Base Tables, which build upon PharmGKB's database. In Green, we depict relationship tables we have built based on the Base tables. The PGX Relationship table draws from all of them to integrate the different concepts.

graph LR
    A(Variant):::YellowClass --> C(Genomic Variation):::GreenClass
    B(Haplotype):::YellowClass --> C(Genomic Variation)
    C --> D(Genetic Factor):::GreenClass
    E(Gene):::YellowClass --> D
    F(Chemicals):::YellowClass --> G(Pharmacogenomic Relationship):::OrangeClass
    D --> G
    H(Evidence):::YellowClass --> G
    I(Biogeographical Group):::YellowClass --> G
    G --> J(Phenotype):::GreenClass
    J --> K(Pharmacodynamic phenotype):::YellowClass
    J --> L(Pharmacokinetic phenotype):::YellowClass
    J --> M(Disease):::YellowClass

    classDef OrangeClass fill:#faa08c,stroke:#50285a,stroke-width:2px
    classDef GreenClass fill:#bee6b4,stroke:#50285a,stroke-width:2px
    classDef YellowClass fill:#fad782,stroke:#50285a,stroke-width:2px

Curation of PharmGKB

Since PharmGKB is the major source of information for the KG, below we detail a few of the assumptions made during data curation and processing, to build the tables that can be found in the corresponding folder. This is simply an overview of the data curation process, please refer to the code for more details.

These tables are merged into the PGX relationship table. At this moment, the PGX mainly encompasses the following information: annotation ID (unique identifier of the relationship described), genomic variation (variant or haplotype) and its ID, gene and its ID, chemical and its ID, pk and pd phenotype, evidence level, association (significance of the association Yes: 1 / No: -1 / Inconclusive: 0) and biogeograhical group.

Use of Large Language Models (LLMs)

LLMs are used to re-rank and offer an explanation of the predictions. We mainly used gpt-4-turbo.

Repository structure

Please note that several folders will be generated when you run the pipeline below, including embeddings, models and results folders. Also, please note that the data folder will be modified.

Experiment pipelines

  1. Run scripts in scripts/0_process_pharmgkb:
  2. Run scripts in scripts/1_preparation:
  3. Run scripts in scripts/2_pairs to
  4. Run scripts in scripts/3_llm_assessment to rerank drug-gene pair predictions based on large language models.
  5. Run scripts on scripts/4_post_analyses to produce analyses on the results, including statistics and plots.

License

The code in this repository is licensed under a GNU General Public License v3.0. The data comes from public repositories and is limited to the licenses stated by the original data producers.

About Us

The Ersilia Open Source Initiative is a Non Profit Organization (1192266) with the mission is to equip labs, universities and clinics in LMIC with AI/ML tools for infectious disease research.

Help us achieve our mission or volunteer with us!