biocommons / hgvs

Python library to parse, format, validate, normalize, and map sequence variants. `pip install hgvs`
https://hgvs.readthedocs.io/
Apache License 2.0
233 stars 94 forks source link

CDS phase (frame offset for eg ribo slippage) not taken into account in amino acid translation #732

Open davmlaw opened 3 months ago

davmlaw commented 3 months ago

The GFF format has a "phase" column on CDS features (values 0,1,2) which alter the reading frame of exons and the translation to amino acids.

The UTA/DataProvider transcript annotation format does not currently contain this information, so I believe it will need to be added, then HGVS code modified to take it into account when converting to p. (similar to how alignment gaps are done between g. and c.)

Example annotation

from ref_GRCh37.p10_top_level.gff3 (phase is the "1" after the "+"):

NC_000007.13    RefSeq  CDS     94292646        94293825        .       +       1       ID=cds13063;Name=NP_001165908.1;Parent=rna16954;Note=isoform 3 is encoded by transcript variant 2;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;exception=ribosomal slippage;gbkey=CDS;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1

Column 8: "phase"

For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end (where the 5' end of the CDS is relative to the strand of the CDS feature) of the current CDS feature. For clarification the 5' end for CDS features on the plus strand is the feature's start and and the 5' end for CDS features on the minus strand is the feature's end. The phase is one of the integers 0, 1, or 2, indicating the number of bases forward from the start of the current CDS feature the next codon begins. A phase of "0" indicates that a codon begins on the first nucleotide of the CDS feature (i.e. 0 bases forward), a phase of "1" indicates that the codon begins at the second nucleotide of this CDS feature and a phase of "2" indicates that the codon begins at the third nucleotide of this region. Note that ‘Phase’ in the context of a GFF3 CDS feature should not be confused with the similar concept of frame that is also a common concept in bioinformatics. Frame is generally calculated as a value for a given base relative to the start of the complete open reading frame (ORF) or the codon (e.g. modulo 3) while CDS phase describes the start of the next codon relative to a given CDS feature.

The phase is REQUIRED for all CDS features.

This was originally raised by holtgrewe on cdot project