EBIvariation / opentargets-pharmgkb

Pipeline to provide evidence strings for Open Targets from PharmGKB
Apache License 2.0
1 stars 1 forks source link

opentargets-pharmgkb

Pipeline to provide evidence strings for Open Targets from PharmGKB

Installation

The pipeline only requires Python 3.8+. Clone the repository (or download a tagged release) and run python setup.py install.

(For EVA users, you have to manually run the deployment script for now, pending automated deployment.)

How to run

1. Set up the environment

For EVA, you should log on to Codon SLURM cluster and become the EVA production user, then refer to the private repository for values.

# The directory where subdirectories for each batch will be created
export BATCH_ROOT_BASE=

# Code location where repository is cloned
export CODE_ROOT=

# Path to GRCh38 RefSeq FASTA file
export FASTA_PATH=

2. Download data

# Year and month for the upcoming Open Targets release.
# For example, if you're processing data for “20.02” release, this variable will be set to `2020-02`.
export OT_RELEASE=YYYY-MM

# Create directory structure for holding all files for the current batch.
export BATCH_ROOT=${BATCH_ROOT_BASE}/batch-${OT_RELEASE}
export DATA_DIR=${BATCH_ROOT}/data
mkdir -p ${BATCH_ROOT} ${DATA_DIR}
cd ${BATCH_ROOT}

# Download data
wget https://api.pharmgkb.org/v1/download/file/data/clinicalAnnotations.zip
wget https://api.pharmgkb.org/v1/download/file/data/variants.zip
wget https://api.pharmgkb.org/v1/download/file/data/relationships.zip

unzip -j clinicalAnnotations.zip "*.tsv" -d $DATA_DIR
unzip -j clinicalAnnotations.zip "CREATED*.txt" -d $DATA_DIR
unzip -j variants.zip "*.tsv" -d $DATA_DIR
unzip -j relationships.zip "*.tsv" -d $DATA_DIR
rm clinicalAnnotations.zip variants.zip relationships.zip

# Set the created date
export CREATED_DATE=`ls $DATA_DIR/CREATED*.txt | sed 's/.*\([0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}\).*/\1/'`

3. Run the pipeline

generate_evidence.py --data-dir $DATA_DIR --fasta $FASTA_PATH --created-date $CREATED_DATE --output-path evidence.json

# One-liner for EVA on SLURM
sbatch -t 02:00:00 --mem=8G -J pharmgkb-evidence -o pharmgkb-evidence.out -e pharmgkb-evidence.err \
  --wrap="${CODE_ROOT}/env/bin/python ${CODE_ROOT}/bin/generate_evidence.py --data-dir $DATA_DIR --fasta $FASTA_PATH --created-date $CREATED_DATE --output-path evidence.json"

4. Manual follow-up actions

Update the metrics spreadsheet based on the output of the pipeline.

The evidence string file (evidence.json) must be uploaded to the Open Targets Google Cloud Storage to the pharmacogenomics folder and be named in the format cttv012-[yyyy]-[mm]-[dd].json.gz (e.g. cttv012-2020-10-21.json.gz).

Once the upload is complete, send an email to Open Targets (data [at] opentargets.org) containing the following information:

Schema documentation

Unless otherwise mentioned, data is taken directly from PharmGKB.

Field Description Example
datasourceId Identifier for data source "pharmgkb"
datasourceVersion Date when data dump was generated, formatted YYYY-MM-DD "2023-08-05"
datatypeId Type of data corresponding to this evidence string (currently only clinical annotation) "clinical_annotation"
studyId Clinical Annotation ID "1449309937"
evidenceLevel Level of evidence (see here) "1A"
literature List of PMIDs associated with this clinical annotation ["11389482", "27857962"]
genotypeId VCF-style (chr_pos_ref_allele1,allele2) identifier of genotype; computed as described below "19_38499645_GGAG_G,GGAG"
variantRsId RS ID of variant "rs121918596"
variantFunctionalConsequenceId Sequence Ontology term, from VEP "SO_0001822"
targetFromSourceId Ensembl stable gene ID, from VEP (rsIDs) or PGKB mapped through BioMart (named alleles) "ENSG00000196218"
genotype Genotype or allele string SNP "TA", indel "del/GAG", repeat "(CA)16/(CA)17", named allele "*6"
genotypeAnnotationText Full annotation string for genotype or allele "Patients with the rs121918596 del/GAG genotype may develop malignant hyperthermia when treated with volatile anesthetics [...]"
directionality Allele function annotation (see Table 2 here) "Decreased function"
haplotypeId Name of haplotype; can be an allele or a genotype "CYP2B6*6" or "GSTT1 non-null/non-null"
haplotypeFromSourceId Internal PGKB identifier for the haplotype "PA165818762"
drugs List of drugs (see below) [{"drugFromSource": "ivacaftor"}, {"drugFromSource": "lumacaftor"}]
pgxCategory Pharmacogenomics phenotype category "toxicity"
phenotypeText Phenotype name "Malignant Hyperthermia"
phenotypeFromSourceId EFO ID of phenotype, mapped through ZOOMA / OXO "Orphanet_423"

Example

Below is an example of a complete clinical annotation evidence string:

{
  "datasourceId": "pharmgkb",
  "datasourceVersion": "2023-08-05",
  "datatypeId": "clinical_annotation",
  "studyId": "1449309937",
  "evidenceLevel": "1A",
  "literature": [
    "11389482",
    "27857962"
  ],
  "genotypeId": "19_38499645_GGAG_G,GGAG",
  "variantRsId": "rs121918596",
  "variantFunctionalConsequenceId": "SO_0001822",
  "targetFromSourceId": "ENSG00000196218",
  "genotype": "del/GAG",
  "genotypeAnnotationText": "Patients with the rs121918596 del/GAG genotype may develop malignant hyperthermia when treated with volatile anesthetics (desflurane, enflurane, halothane, isoflurane, methoxyflurane, sevoflurane) and/or succinylcholine as compared to patients with the GAG/GAG genotype. Other genetic or clinical factors may also influence the risk for malignant hyperthermia.",
  "drugs": [
    {"drugFromSource": "succinylcholine"}
  ],
  "pgxCategory": "toxicity",
  "phenotypeText": "Malignant Hyperthermia",
  "phenotypeFromSourceId": "Orphanet_423"
}

Other examples can be found in the tests, though keep in mind these may not represent real data.

Variant coordinate algorithm

graph TD
    J[PharmGKB]
    H[FASTA files]
    E[Clinical alleles table]
    A[Variant table]    
    D[Generate 'chr_pos_ref_allele1,allele2' identifier]
    S[NCBI Genome Assembly]
    J --> A
    J --> E
    S --> H
    A --> |locations 'Chr+Pos'| D
    H --> |Reference + context| D
    E --> |Alternate alleles| D

Drug representation

The drugs property is a list of structs with 2 keys:

Lists of drugs are kept together (rather than exploded into separate evidence strings) when they're known to be annotated as a drug combination. Currently this is only when they're /-separated and associated with a single PGKB chemical ID, as in ivacaftor / lumacaftor.