Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
349 stars 79 forks source link

Braker produces non-conforming GFFv3 #683

Closed maol-corteva closed 10 months ago

maol-corteva commented 11 months ago

Dear devs. The GFF v3 produced by the pipeline does not include STOP codons in the coordinates of the CDS and fails parsing (or shows wrong) by some downstream tools and genome browsers. I am guessing that the GFF3 is being built from an output file in GTF format. Unlike the GTF format, the GFF v3 does require these coordinates to be included as part of the CDS:

From: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md " NOTE 5 The START and STOP codons are included in the CDS. That is, if the locations of the start and stop codons are known, the first three base pairs of the CDS should correspond to the start codon and the last three correspond the stop codon. "

PS: I have edited this comment to remove "START" from the phrase on top, as my statement was wrong before. I meant to say that the produced GFFv3 files is missing the STOP codon coordinates at the CDS and EXON features, (but the Start codon, is included)

LarsGab commented 11 months ago

Hi,

the start_codon and stop_codon features should be included for all transcripts in the braker.gtf and braker.gff3 files. Which versions of BRAKER and TSEBRA are you using?

Best, Lars

maol-corteva commented 11 months ago

Hi Lars, thanks for responding.

Our container image was built in July 10 2023 from the master git branch at this site (https://github.com/Gaius-Augustus/BRAKER), so probably between version 3.0.2 and 3.0.3 releases and it contains everything (tsebra, etc). It is heavily based on the Dockerfile example in the same distribution (the major diff is the replacement with the latest commercial licensed version of gmes and addition of license key provided to us by your commercial team). We get the same issues regarding missing stop codons in "braker.gff3" when we start the pipeline with only RNAseq or when we start with Proteins.

Here is an excerpt from the braker.gff3 line (top gene in file). Notice the stop_codon is listed at the last line of block, but its coordinates are not included in the CDS nor in the exon features. (just like you would on a GTF file, but this is supposed to be GFF3). This happens throughout the entire file:

## wrongly formatted GFF3 follows:
C0      AUGUSTUS        gene          14289   17027   .       +       .       ID=g1;
C0      AUGUSTUS        mRNA          14289   17027   1       +       .       ID=g1.t1;Parent=g1;
C0      AUGUSTUS        start_codon   14289   14291   .       +       0       ID=g1.t1.start1;Parent=g1.t1;
C0      AUGUSTUS        CDS           14289   15903   1       +       0       ID=g1.t1.CDS1;Parent=g1.t1;
C0      AUGUSTUS        exon          14289   15903   .       +       .       ID=g1.t1.exon1;Parent=g1.t1;
C0      AUGUSTUS        intron        15904   15960   1       +       .       ID=g1.t1.intron1;Parent=g1.t1;
C0      AUGUSTUS        CDS           15961   16317   1       +       2       ID=g1.t1.CDS2;Parent=g1.t1;
C0      AUGUSTUS        exon          15961   16317   .       +       .       ID=g1.t1.exon2;Parent=g1.t1;
C0      AUGUSTUS        intron        16318   16374   1       +       .       ID=g1.t1.intron2;Parent=g1.t1;
C0      AUGUSTUS        CDS           16375   16749   1       +       2       ID=g1.t1.CDS3;Parent=g1.t1;
C0      AUGUSTUS        exon          16375   16749   .       +       .       ID=g1.t1.exon3;Parent=g1.t1;
C0      AUGUSTUS        intron        16750   16806   1       +       .       ID=g1.t1.intron3;Parent=g1.t1;
C0      AUGUSTUS        CDS           16807   17024   1       +       2       ID=g1.t1.CDS4;Parent=g1.t1;
C0      AUGUSTUS        exon          16807   17024   .       +       .       ID=g1.t1.exon4;Parent=g1.t1;
C0      AUGUSTUS        stop_codon    17025   17027   .       +       0       ID=g1.t1.stop1;Parent=g1.t1;

The block above and the corresponding one in braker.gtf are identical (coordinate-wise) which leads me to believe that the GFFv3 coordinates is a simple copy of the GTF version. A corrected version would extend the CDS from 17024 to the stop_codon's position downstream at 17027 and thus include the whole range at [14289..17027]. Ditto for the exon in this case.

LarsGab commented 11 months ago

Hi,

in BRAKER 3.0.3, we fixed some issues with the conversion from GTF to GFF3 format. In my test examples, I haven't encountered this issue, and both the start and stop codons are present in the first/last CDS. The GTF includes these codons in the CDS features as well, and this should have been the case even before 3.0.3. Updating your BRAKER version might resolve the issue.

Best, Lars

KatharinaHoff commented 10 months ago

I assume that this issue was resolved. If not, please re-open.