Variants identifiers and variant pipeline data

miquelduranfrigola commented 1 year ago

Hi @GemmaTuron and @AnnaMontaner

Here are some thoughts about variant identifiers, some of which we have already discussed.

Recap

The goal is to have three inputs for the tri-modal neural network (TNN):

Compounds
Genes
Variants

These three inputs are numerical. Therefore, we need a way to encode each entity (a compound, a gene, a variant) in a numerical format. To get this numerical information, we need to collect or calculate properties of the entities. Therefore, the identifiers we use must be external to PharmGKB and as widely accepted as possible, in order to maximize our chances to get information for each entity.

Compounds

For compounds, we are very well covered with the Chemical Checker and other chemoinformatics tools that we have at Ersilia:

Primary identifier: SMILES string
Vectors: Chemical Checker signatures, Grover embeddings, etc.

Genes

For genes, we are also very well covered thanks to the Bioteque and recent sequence embedding technologies such as ESM-1b:

Primary identifier: Gene name (or UniProt Accession Code)

Variants

For variants, we still have not decided what is a good identifier, and what is a good numerical vector to represent those variants as inputs for the model. Below I write up a few ideas. Let's use this issue thread to come up with a strategy.

Haplotypes and variants

As discussed, PharmGKB annotates both haplotypes and variants. Typically, a haploytype can be deconvoluted into multiple (n) variants. As far as I know, haplotype-variant pairs from PharmGKB cannot be directly downloaded and need to be accessed online, for example through an Allele Definition table. Therefore, given a compound-gene-haplotype association, we must expand it to n compound-gene-variant associations.

Variants and alleles

One variant, for example specified with a dbSNP identifier, may have multiple alleles. In my opinion, unless we find a strong need for it, we do not want to go into the allele level of annotation. Therefore, when data comes from PharmGKB, if multiple alleles are associated with the same compound-gene-variant triplet, we deduplicate them. In case different phenotypes/outcomes are associated to different alleles, we can either:

Keep multiple compound-gene-variant-outcome quadruplets, or
keep the compound-gene-variant-outcome quadruplet with the highest evidence, defined as (a) being significant, (b) level of evidence (1, 2, 3...), etc.

Most likely, we should go for option 1 above.

Primary variant identifiers

The choice of variant identifier will depend on:

Identifiers used in PharmGKB (@GemmaTuron)
Identifiers used in African genomics databases (@AnnaMontaner)

Ideally, we would have a common identifier, although this is not mandatory, as long as we have a way to use the identifiers as inputs for the variant vectorization/featurization pipeline (see below).

PharmGKB often uses dbSNP identifiers, but not always. We should quantify what is the coverage of dbSNP identifiers.

Genomic positions can also be used as an identifier.

Variant vectors/features

Given a variant identifier, we must define a pipeline to obtain numerical representations of the variant. Typically, this corresponds to calculated or annotated features for the variants. @AnnaMontaner already has a pipeline in place that produces, given a genomic position, a wide array of annotations/calculations. I would suggest that we start from here.

Current variant featurization pipeline

A key question for @AnnaMontaner would be: does your pipeline need allele-level information (probably yes), or can it deal somehow with multiple alleles at the same time, given a genomic position? This will determine which level of granularity we keep for the variants.

In any case, the table contains interesting information, such as predicted functional impact (e.g. with PolyPhen), effect (missense, etc.), etc. All of these, be them numerical or categorical, can eventually be used for vectorization of the variant.

How do we vectorize the current features table?

The easiest way to do vectorization of the current table by @AnnaMontaner , as is, would be to use an autoencoder. The idea is very simple, it is like doing a PCA, going from n mixed-type (categorical, numerical) columns (i.e. the original table) to m numerical dimensions, where m < n, typically m = 16, 32, 64, 128, 256, or 512... Most probably, we will go for m = 16 or 32.

To train this autoencoder, we need a table that is as big as possible. Therefore, an immediate question for @AnnaMontaner would be: how long does it take for the pipeline to finish? How many variants can we feasibly consider? Ideally, we would have the table ready for all variants observed in all genes available in PharmGKB.

Next steps

Based on all the previous information, I would suggest the following next steps:

Get all variants from PharmGKB (@GemmaTuron)
Run the variant featurization pipeline for these variants (@AnnaMontaner)
Built an autoencoder based on the variant features table (@miquelduranfrigola)

How does this sound?

GemmaTuron commented 1 year ago

I am still working on the variant part, see #12 and also, keep an eye on #16

miquelduranfrigola commented 1 year ago

These are some links that could be interesting:

https://github.com/PharmGKB/vcf-parser https://pharmgkb.blogspot.com/2021/09/pharmcat-version-10-released.html https://pharmcat.org/using/Calling-HLA/ https://samtools.github.io/hts-specs/VCFv4.1.pdf

miquelduranfrigola commented 1 year ago

I think we are ready to close this issue.

Haplotypes have been mapped to variants (thanks @GemmaTuron)
Multiple encoding/embedding techinques have been developed for variants, mostly based on SNPEff (thanks @AnnaMontaner)

ersilia-os / pharmacogx-embeddings