ersilia-os / pharmacogx-embeddings

Pharmacogenomics knowledge graph embeddings and related analyses
GNU General Public License v3.0
3 stars 0 forks source link

Deduplicate haplotypes #12

Closed GemmaTuron closed 1 year ago

GemmaTuron commented 1 year ago

Use the information of each haplotype collected in the separate files to dedupliacte the haplotypes in the clinical annotation, clincial variant etc files (i,e CYP3A4*1 corresponds to rs..., rs... etc)

GemmaTuron commented 1 year ago

We have collected all the haplotypes in the tables clinical_annotation, clinical_variant, var_drug_ann, var_pheno_ann and relantionships and downloaded the corresponding gene_allele_definition_tables (some had old formats that had to be manually curated). These have been processed into files for each gene. The pharmgkb haplotype id (hid) has been retrieved from the PharmGKB api, since many haplotypes did not appear in the relationships.csv where hids could be found. Next steps include using this updated haplotype table to parse the clinical annotation and other files where variant and haplotype have been deconvoluted.

GemmaTuron commented 1 year ago

I have identified some "orphan" variants that did not appear in the variants.csv file or the multiple haplotype files. These have been collected under orphan_vars and will need to be manually incorporated in variants.csv

Once this is done, we will work on adding all the rsID (when available) and variant_id to each variant in each haplotype. This way, the variant info will be easily accessible via pharmgkb API. This will be retrieved from the following search on PharmGKB API

GemmaTuron commented 1 year ago

This commit obtains the variant id and the gene of the orphan variants, and collates a file named variant_complete.csv For some variants, there is only a variant annotation but they are not recorded as a variant in PharmGKB, so there is no variant ID associated. Whenever possible, I have retrieved the gene and gene id for these cases, like: CYP2A6 low activity,,CYP2A6,PA121 Also, some variants are not associated to any gene, like rs2110179,PA166280082

GemmaTuron commented 1 year ago

The haplotypes are deduplicated to their respective variants in the "final_tables" files. Those haplotypes for which no variants are available have not been considered, only the gene and its link to a drug have been kept (for example, for GSTM1 and GSTT1, which have null versions of the gene)