Closed maol-corteva closed 10 months ago
Hi,
the start_codon and stop_codon features should be included for all transcripts in the braker.gtf
and braker.gff3
files.
Which versions of BRAKER and TSEBRA are you using?
Best, Lars
Hi Lars, thanks for responding.
Our container image was built in July 10 2023 from the master git branch at this site (https://github.com/Gaius-Augustus/BRAKER), so probably between version 3.0.2 and 3.0.3 releases and it contains everything (tsebra, etc). It is heavily based on the Dockerfile example in the same distribution (the major diff is the replacement with the latest commercial licensed version of gmes and addition of license key provided to us by your commercial team). We get the same issues regarding missing stop codons in "braker.gff3" when we start the pipeline with only RNAseq or when we start with Proteins.
Here is an excerpt from the braker.gff3 line (top gene in file). Notice the stop_codon is listed at the last line of block, but its coordinates are not included in the CDS nor in the exon features. (just like you would on a GTF file, but this is supposed to be GFF3). This happens throughout the entire file:
## wrongly formatted GFF3 follows:
C0 AUGUSTUS gene 14289 17027 . + . ID=g1;
C0 AUGUSTUS mRNA 14289 17027 1 + . ID=g1.t1;Parent=g1;
C0 AUGUSTUS start_codon 14289 14291 . + 0 ID=g1.t1.start1;Parent=g1.t1;
C0 AUGUSTUS CDS 14289 15903 1 + 0 ID=g1.t1.CDS1;Parent=g1.t1;
C0 AUGUSTUS exon 14289 15903 . + . ID=g1.t1.exon1;Parent=g1.t1;
C0 AUGUSTUS intron 15904 15960 1 + . ID=g1.t1.intron1;Parent=g1.t1;
C0 AUGUSTUS CDS 15961 16317 1 + 2 ID=g1.t1.CDS2;Parent=g1.t1;
C0 AUGUSTUS exon 15961 16317 . + . ID=g1.t1.exon2;Parent=g1.t1;
C0 AUGUSTUS intron 16318 16374 1 + . ID=g1.t1.intron2;Parent=g1.t1;
C0 AUGUSTUS CDS 16375 16749 1 + 2 ID=g1.t1.CDS3;Parent=g1.t1;
C0 AUGUSTUS exon 16375 16749 . + . ID=g1.t1.exon3;Parent=g1.t1;
C0 AUGUSTUS intron 16750 16806 1 + . ID=g1.t1.intron3;Parent=g1.t1;
C0 AUGUSTUS CDS 16807 17024 1 + 2 ID=g1.t1.CDS4;Parent=g1.t1;
C0 AUGUSTUS exon 16807 17024 . + . ID=g1.t1.exon4;Parent=g1.t1;
C0 AUGUSTUS stop_codon 17025 17027 . + 0 ID=g1.t1.stop1;Parent=g1.t1;
The block above and the corresponding one in braker.gtf are identical (coordinate-wise) which leads me to believe that the GFFv3 coordinates is a simple copy of the GTF version. A corrected version would extend the CDS from 17024 to the stop_codon's position downstream at 17027 and thus include the whole range at [14289..17027]. Ditto for the exon in this case.
Hi,
in BRAKER 3.0.3, we fixed some issues with the conversion from GTF to GFF3 format. In my test examples, I haven't encountered this issue, and both the start and stop codons are present in the first/last CDS. The GTF includes these codons in the CDS features as well, and this should have been the case even before 3.0.3. Updating your BRAKER version might resolve the issue.
Best, Lars
I assume that this issue was resolved. If not, please re-open.
Dear devs. The GFF v3 produced by the pipeline does not include STOP codons in the coordinates of the CDS and fails parsing (or shows wrong) by some downstream tools and genome browsers. I am guessing that the GFF3 is being built from an output file in GTF format. Unlike the GTF format, the GFF v3 does require these coordinates to be included as part of the CDS:
From: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md " NOTE 5 The START and STOP codons are included in the CDS. That is, if the locations of the start and stop codons are known, the first three base pairs of the CDS should correspond to the start codon and the last three correspond the stop codon. "
PS: I have edited this comment to remove "START" from the phrase on top, as my statement was wrong before. I meant to say that the produced GFFv3 files is missing the STOP codon coordinates at the CDS and EXON features, (but the Start codon, is included)