ersilia-os / pharmacogx-embeddings

Pharmacogenomics knowledge graph embeddings and related analyses
GNU General Public License v3.0
3 stars 0 forks source link

Select a method to encode variants #17

Closed miquelduranfrigola closed 6 months ago

miquelduranfrigola commented 12 months ago

We are currently exploring two methods for encoding variants:

SnpEff & encoding techniques

This is an in-house method where SnpEff is run by @AnnaMontaner, producing a tabular output with ~150 columns, that is subsequently autoencoded to an embedding format. An exploration of this methodology can be found in this notebook, and more explanation of the data can be found in this README file.

The remaining steps are:

SNP2Vec

This recently-published method aims at pre-training SNP representations. There is code available in a GitHub repository, although it doesn't seem to provide pre-trained models. Next steps are:

miquelduranfrigola commented 12 months ago

I have come up with yet another way of embedding variants, although I would need some help.

At least in our working subset of 1000 Genomes, many of the variants are missense mutations, where an HGVS p notation is provided (p.Ile40Thr). This allows us, in principle, to obtain "residue-level" embeddings from proteins, which represent the "context" of that particular preotein sequence position. I have found out that UniProt already provides pre-calculated embeddings, which simplifies the problem greatly: https://www.uniprot.org/help/downloads#embeddings

What we need, though, is the UniProt AC and sequence position (in UniProt, not in Ensembl, although presumably they are the same) for each missense variant.

@AnnaMontaner, do you think this is relatively easy to obtain? On a quick search, based on the current SnpEff output, we could in principle access Ensembl and then, from there, retrieve UniProt information if available. We should be able to do it programmatically. However, SnpEff itself may already provide UniProt information, which would be safer.

What are your thoughts?

miquelduranfrigola commented 6 months ago

We have found that most variants in PharmGKB are intron variants and, therefore, current embedding techniques for variants may not be necessary at this stage. I am closing this comment for now. We may want to reopen it in the future.