The-Sequence-Ontology / Specifications

GFF and GVF specification documents
208 stars 91 forks source link

Difference between phase and frame is unclear in the GFF3 spec #20

Closed jbethune closed 3 years ago

jbethune commented 5 years ago

https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md#readme

In section Column 8: "phase" it says:

This is NOT to be confused with the frame, which is simply start modulo 3. What does "start" refer to? The start column of the GFF3 file? In that case it would be the start position of the entire chromosome? Or does it refer to the start position of the start-codon?

An example that shows the differences between phase and frame would be appreciated.

keilbeck commented 5 years ago

Hi Thank you for bringing up this ambiguity The start in this situation is the start of the codon. @srynobio @barrymoore Do either of you have a good turn of phrase to clean up the last sentence of the first phase paragraph? --TH anks

Column 8: "phase" For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3.

For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

The phase is REQUIRED for all CDS features.

barrymoore commented 5 years ago

Good point - the current wording is rather convoluted. Here’s my attempt to clarify this a bit:

For features of type "CDS", the phase indicates where the next codon begins relative to the start of the current CDS feature. The phase is one of the integers 0, 1, or 2, indicating the number of bases forward from the start of the current CDS feature the next codon begins. A phase of "0" indicates that a codon begins on the first base of the CDS feature (i.e. 0 bases forward), a phase of "1" indicates that the next codon begins at the second base of this region and a phase of "2" indicates that the codon begins at the third base of this region. Note that ‘Phase’ in the context of a GFF3 CDS feature should not be confused with the similar concept of frame that is also a common concept in bioinformatics. Frame is generally calculated as a value for a given base relative to the start of a codon (e.g. modulo 3) while CDS phase describes the start of the next codon relative to a given CDS feature.

Hmm, did I make that clearer or just obfuscate it with different words :)

Barry

On Sep 11, 2019, at 1:41 PM, Karen EIlbeck notifications@github.com<mailto:notifications@github.com> wrote:

Hi Thank you for bringing up this ambiguity The start in this situation is the start of the codon. @srynobiohttps://github.com/srynobio @barrymoorehttps://github.com/barrymoore Do either of you have a good turn of phrase to clean up the last sentence of the first phase paragraph? --TH anks

Column 8: "phase" For features of type "CDS", the phase indicates where the feature begins with reference to the reading frame. The phase is one of the integers 0, 1, or 2, indicating the number of bases that should be removed from the beginning of this feature to reach the first base of the next codon. In other words, a phase of "0" indicates that the next codon begins at the first base of the region described by the current line, a phase of "1" indicates that the next codon begins at the second base of this region, and a phase of "2" indicates that the codon begins at the third base of this region. This is NOT to be confused with the frame, which is simply start modulo 3.

For forward strand features, phase is counted from the start field. For reverse strand features, phase is counted from the end field.

The phase is REQUIRED for all CDS features.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/The-Sequence-Ontology/Specifications/issues/20?email_source=notifications&email_token=AARDRW6BJKBVH3XQ45SVHYDQJFCWBA5CNFSM4IVSTAE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6PVDMI#issuecomment-530534833, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AARDRWYASEEVSK7EZQ6S4K3QJFCWBANCNFSM4IVSTAEQ.

jbethune commented 5 years ago

This seems to be really difficult to put into words. Let me try a formulation of my own based on my current understanding:

For features of type "CDS", the phase tell us if the CDS begins with a complete codon (phase=0) or with an incomplete codon (phase=1 or phase=2). Incomplete codons are split across two exons and become complete after splicing. The phase tells us how many nucleotides we have to move towards a larger or smaller genomic position to get to the first complete codon in the genomic sequence of this CDS. On the plus strand we need to move from the start position to a larger genomic position and on the minus strand we need to move from the end position to a smaller genomic position.

The following table shows the different situations:

phase strand meaning of start/end column move towards complete codon
0 + start=first base of complete codon no move needed
1 + start=third base of incomplete codon move up from start position by 1 nucleotide
2 + start=second base of incomplete codon move up from start position by 2 nucleotides
0 - end=third base of complete codon no move needed
1 - end=first base of incomplete codon move down from end position by 1 nucleotide
2 - end=second base of incomplete codon move down from end position by 2 nucleotides

Positions are always inclusive. The phase is REQUIRED for all CDS features.

The phase should not be confused with the reading frame. The reading frame refers to the genomic distance to the start codon of this gene modulo 3 regardless of where introns are located.

Is my understanding correct?

edit: Added table with the 6 different cases.

barrymoore commented 5 years ago

You description is correct, except that I think it’s worth clarifying your use of 5’ to 3’ and this also addresses your question about the minus strand.

The CDS Phase is always relative to the strand on which the containing transcript lies. So your statement about 5’ to 3’ is only accurate if you’re referring to the mRNA. For minus strand transcripts the genomic direction would be 3’ to 5’.

Thanks for the discussion. I’ve added some additional language to help clarify the minus/plus strand issues.

Spec is updated to 1.25. Feedback welcomed.

Regards

Barry

On Sep 23, 2019, at 2:42 AM, jbethune notifications@github.com<mailto:notifications@github.com> wrote:

This seems to be really difficult to put into words. Let me try a formulation of my own based on my current understanding:

For features of type "CDS", the phase tell us if the CDS starts with a complete codon (phase=0) or with an incomplete codon (phase=1 or phase=2). Incomplete codons are split across two exons and become complete after splicing. The phase tells you how many nucleotides you have to move in the 5' to 3' direction to get to the first complete codon in the genomic sequence of this CDS. For example, phase=1 means that you have to move 1 nucleotide and that you are in the 3rd base of an incomplete codon. Phase=2 means that you have to move 2 nucleotides and that you are in the 2nd base of an incomplete codon. Phase 0 means that you are already on the first base of a complete codon.

The phase should not be confused with the reading frame. The reading frame refers to the genomic distance to the start codon of this gene modulo 3 regardless of where introns are located.

Is my understanding correct? I am also unsure about the minus strand. It would be really good if the specification also explains what the phase means for the 3 cases on the minus strand.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/The-Sequence-Ontology/Specifications/issues/20?email_source=notifications&email_token=AARDRWY43YG4IHKWPJ5HGLTQLB6P5A5CNFSM4IVSTAE2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7KFLBA#issuecomment-534009220, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AARDRWYX45OV4BDYJDCKGBTQLB6P5ANCNFSM4IVSTAEQ.