The-Sequence-Ontology / Specifications

GFF and GVF specification documents
208 stars 91 forks source link

GFF3: Phase > feature length not clearly defined #2

Open satta opened 8 years ago

satta commented 8 years ago

The GFF3 specification is not really clear about how to treat CDS features of 1bp length that have a phase of 2 defined. According to

The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region.

it is apparently assumed that length >= phase so the amount of bases can be 'skipped' as indicated by the phase. However, we have encountered cases in draft genomes (cf. https://github.com/genometools/genometools/issues/793) where such short CDS show up. Am I correct in assuming that in such cases the remaining phase shift is supposed to be 'carried forward' to the next CDS?

keilbeck commented 8 years ago

I am copying Barry in on this thread as he has the most experience with this kind of annotation. I am concerned about an exon of 1. Does this really happen?

lucventurini commented 8 years ago

Dear Ms. Beck, unfortunately it does in annotations of draft genomes - the short exon might be due to errors in the genomic assembly (I am working with quite complex species in terms of genome assembly). I do agree that it looks odd and it is a corner case, but still, it is something that happens ..

satta commented 8 years ago

I agree. IMHO this is more of a question of specification completeness (promoting development of more robust parsers by avoiding undefined behaviour) than a question of biological domain relevance.

barrymoore commented 6 years ago

Hi all, My alerts from GitHub for this project were getting triaged in my mail, so I'm slow to comment. In my opinion parsers should throw a fatal error if phase points outside of the current CDS. I agree that we need to support any length of CDS even though they may not be biologically true, but I think it's a bit much to ask both the spec and parsers to allow the phase to 'roll over' to the next exon. I'd be happy to add a sentence to the spec indicating this if there is agreement.

satta commented 6 years ago

Thanks! I agree, there needs to be some agreement one way or the other. As soon as there's an addition to the spec I'd be happy to address it in our parser. Could you probably ping this thread once if possible once that's done -- thanks!

barrymoore commented 6 years ago

@keilbeck are you OK with updating the spec indicating that it is invalid to have a Phase value that points outside of the overall CDS feature?

lucventurini commented 6 years ago

Dear @barrymoore, I understand your concern, but I would disagree with your opinion. Many genome sequences are not in a reference-like status, but they are rather fragmented in many scaffolds - thus breaking coding sequences across multiple contigs (from which a starting/ending 1bp exon). In other cases, the complexity of the region results in an indel, leading to an internal 1bp exon which is "real" - the error being not in the annotation, but rather in the underlying genomic assembly.

In this kind of situation, it is important for the GFF to convey this kind of information, both for completeness, and for diagnostics purposes. I would argue therefore for the inclusion of such cases in the specifications.

barrymoore commented 6 years ago

Hi all, I agree with @lucventurini this can happen frequently in draft genomes (and even occasionally in reference genomes). I'm not suggesting that the spec prohibit 1-2 bp exons or CDSs but rather thinking about how parsers should behave when the Phase points to a position beyond the boundary of the exon. Actually, I think this is easier to handle than I first suggested above. Phase is telling us how many nts to skip to reach the next codon. The 'skipped' nts are concatenated to a previous codon (if we're building a mature transcript sequence), so the rule for the parser should simply be that the Phase gives you the max (rather than exact) number of nts to concatenate to the previous codon before you start a new codon or move to the next exon. There is no need for parsers to 'carry forward' the fact that phase 2 points beyond the end of the single nt exon. All the parser needs to do is concatenate as many nts as the phase described (up to the length of the exon) to the current codon and move on. Parsers already carry incomplete codons forward from one exon to the next and the next exon takes care of defining it's own phase (as is already the case) to complete any partial codons.

In the figure below we have three transcripts, each with 3 exons where the 2nd exon is 1 nt long. The angled hash marks indicate one (codonA) and the horizontal hash marks indicate a second codon (codonB). Transcript A is easy, exon 1 completes codonA, exon 2 has Phase=0 and so the parser starts a new codon and adds the first (and only) nt to codonB; exon 3 has Phase=2 to complete codonB. Transcript B has Phase=1, so the parser concatenates <= 1 nt to codonA that and exon 3 with Phase=0 will start a new codon on it's first nt. Finally, in transcript 3 the parser has 1 nt of codonA carried forward from exon 1; exon 2 has Phase=2 so the parser concatenates <= 2 nts to the existing codonA (since exon 2 only has 1 nt, the parser concatenates that 1 nt and moves on); exon 3 has Phase=1 so the parser completes the 3-exon codon. Having a codon spread across 3 exons as is the case in transcript C is probably biologically invalid (but hey it's biology, so who knows!), but there is no reason for the spec to prohibit it as far as I can tell and the existing Phase tag can support it.

image

If exon 2 is 1 nt long because it's truncated by an incomplete assembly then it still doesn't matter that Phase=2, the parser would still concatenate <=2 nts to the existing codon that was carried forward from exon 1. It is up to the annotation tools/curators to correctly specify the Phase of exon 3 and whether exon 2 is a fragment (http://tinyurl.com/ybk4z7wb), so I think there is no need for the parser or spec to do anything special except maybe to clarify this a bit. I'll suggest the following clarification (in emphasis).

Current spec:

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3.

Updated spec:

For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. In the special case of a short CDS features with a length < 3, the Phase may indicate a position beyond the length of the current CDS feature. In that case all nucleotides in the current exon should be added to the codon carried forward from a previous exon and the Phase of the following exon (if one exists) will allow for completion of the current codon. This means that it is possible to for GFF3 to describe a single codon split across 3 different CDS features. While this may not be biologically relevant this type of annotation has been observed and parsers can tolerate this by not trying to start a new codon if (phase + 1 > length(exon)). Phase should NOT to be confused with the reading frame.