VariantEffect / mavedb-api

MaveDB API
GNU Affero General Public License v3.0
9 stars 2 forks source link

Handling differences between target sequences and reference sequences #79

Open afrubin opened 1 year ago

afrubin commented 1 year ago

MaveDB allows users to specify accession numbers from major genomic databases (Ensembl, RefSeq, UniProt) when depositing a target sequence. As we develop a validation framework for these accession numbers, it will be important to handle cases where a target sequence is similar but not identical to the reference sequence.

There are many cases where this is useful. For example, one of the TP53 datasets in MaveDB was performed on a non-reference allele (see: https://mavedb.org/#/experiment-sets/urn:mavedb:00000068). To address this, the target was entered as "TP53 (P72R)" (e.g. for https://mavedb.org/#/score-sets/urn:mavedb:00000068-a-1).

If we wanted to associate this target with a transcript from RefSeq we could:

Of these, it seems that the last option is clearly the best one.

We should be able to do this in the API by adding associated VRS objects that describe the differences between the given reference sequence and the target sequence. From there we can build the necessary UI elements to convey this information to the user concisely.

bencap commented 1 month ago

Minimally:

Consider CAT VRS for this as well.

When we implement this, we will also have to review the alignment output to ensure (difference between finding the best reference vs. reporting the exact differences between the target and the found reference).