Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
334 stars 80 forks source link

gmst same coords for begin and end for several start or stop codons #751

Open jbh-cas opened 5 months ago

jbh-cas commented 5 months ago

Having run etp mode on a genome with braker.pl version 3.0.3, occasionally gmst exon, CDS and start codons have the same coords for begin and end (field 4 equals field 5) or the exon, CDS and stop codon have the same field 4 and field 5 value. Is there a newer version that fixes this? Thanks for any insights.

--Jim Henderson

Here's a list of issues in this gff simply looking for same field 4 field 5 values

$ awk '$4==$5' braker.gff3
Chr1_Sni    gmst    CDS 240778583   240778583   63.101682   +   0   ID=g3826.t1.CDS1;Parent=g3826.t1;
Chr1_Sni    gmst    exon    240778583   240778583   63.101682   +   0   ID=g3826.t1.exon1;Parent=g3826.t1;
Chr1_Sni    gmst    start_codon 240778583   240778583   63.101682   +   0   ID=g3826.t1.start1;Parent=g3826.t1;
Chr2_Sni    gmst    CDS 1108821 1108821 106.596713  -   1   ID=g4231.t2.CDS1;Parent=g4231.t2;
Chr2_Sni    gmst    exon    1108821 1108821 106.596713  -   1   ID=g4231.t2.exon1;Parent=g4231.t2;
Chr2_Sni    gmst    stop_codon  1108821 1108821 106.596713  -   0   ID=g4231.t2.stop1;Parent=g4231.t2;
Chr2_Sni    gmst    CDS 115730838   115730838   41.396604   -   0   ID=g5023.t1.CDS4;Parent=g5023.t1;
Chr2_Sni    gmst    exon    115730838   115730838   41.396604   -   0   ID=g5023.t1.exon4;Parent=g5023.t1;
Chr2_Sni    gmst    start_codon 115730838   115730838   41.396604   -   0   ID=g5023.t1.start1;Parent=g5023.t1;
Chr3_Sni    gmst    CDS 172710270   172710270   51.222308   -   1   ID=g7266.t1.CDS1;Parent=g7266.t1;
Chr3_Sni    gmst    exon    172710270   172710270   51.222308   -   1   ID=g7266.t1.exon1;Parent=g7266.t1;
Chr3_Sni    gmst    stop_codon  172710270   172710270   51.222308   -   0   ID=g7266.t1.stop1;Parent=g7266.t1;
Chr5_Sni    gmst    CDS 75854187    75854187    263.010526  +   1   ID=g9616.t1.CDS22;Parent=g9616.t1;
Chr5_Sni    gmst    exon    75854187    75854187    263.010526  +   1   ID=g9616.t1.exon22;Parent=g9616.t1;
Chr5_Sni    gmst    stop_codon  75854187    75854187    263.010526  +   0   ID=g9616.t1.stop1;Parent=g9616.t1;

And here are data for one of these examples, g7266.t1 where 172710270 occurs twice in the CDS, exon and stop lines (all 5 gmst problems are similar with field 4 and 5 having same value at start or stop)

braker.gff3
Chr3_Sni        gmst    gene    172710270       172722165       .       -       .       ID=g7266;
Chr3_Sni        gmst    mRNA    172710270       172722165       .       -       .       ID=g7266.t1;Parent=g7266;
Chr3_Sni        gmst    CDS     172710270       172710270       51.222308       -       1       ID=g7266.t1.CDS1;Parent=g7266.t1;
Chr3_Sni        gmst    exon    172710270       172710270       51.222308       -       1       ID=g7266.t1.exon1;Parent=g7266.t1;
Chr3_Sni        gmst    stop_codon      172710270       172710270       51.222308       -       0       ID=g7266.t1.stop1;Parent=g7266.t1;
Chr3_Sni        gmst    intron  172710271       172711737       51.222308       -       0       ID=g7266.t1.intron1;Parent=g7266.t1;
Chr3_Sni        gmst    CDS     172711738       172711925       51.222308       -       0       ID=g7266.t1.CDS2;Parent=g7266.t1;
Chr3_Sni        gmst    exon    172711738       172711925       51.222308       -       0       ID=g7266.t1.exon2;Parent=g7266.t1;
Chr3_Sni        gmst    intron  172711926       172712738       51.222308       -       0       ID=g7266.t1.intron2;Parent=g7266.t1;
Chr3_Sni        gmst    CDS     172712739       172712858       51.222308       -       0       ID=g7266.t1.CDS3;Parent=g7266.t1;
Chr3_Sni        gmst    exon    172712739       172712858       51.222308       -       0       ID=g7266.t1.exon3;Parent=g7266.t1;
Chr3_Sni        gmst    intron  172712859       172721901       51.222308       -       0       ID=g7266.t1.intron3;Parent=g7266.t1;
Chr3_Sni        gmst    CDS     172721902       172722165       51.222308       -       0       ID=g7266.t1.CDS4;Parent=g7266.t1;
Chr3_Sni        gmst    exon    172721902       172722165       51.222308       -       0       ID=g7266.t1.exon4;Parent=g7266.t1;
Chr3_Sni        gmst    start_codon     172722163       172722165       51.222308       -       0       ID=g7266.t1.start1;Parent=g7266.t1;

braker.codingseq
>g7266.t1
ATGTGGGTACTGGTGGCTTTGCTGGCGGCGGCTGCGGGGGCTCTGGGGATCCCTCCGCAC
GAGGACGCGGCTCGGGTCGCGCGCTTCGTGGTGCACTCGTGCAACTGGGGGGCGCTGGCA
ACGCTCTCATCGCAGGATCCCCCGATGCGGGGCCAGCCCTTCTCCAACGTCTTCTCCGTC
AGCGACGGCCCAGCGACGACCTCAGGCACGGGGGTGCCCTACATGTACCTGACCGGCCTG
GATGTCTCCGTACACGACCTGCAGGTGAATGCAAATGCCTCCCTAACAATGTCCTTGGCA
CAGACTTCTTACTGCAAGAGCAAAGGTTATGATCCCCAGAGTCCTCTATGTGCCCATGTG
ATCTTCTCAGGGGTAGTTGAGAAGGTCCCAAATGGCACAGAAACAGACTTTGCCAAAATA
GCACTGTTCAGCAGACATCCTGAAATGGCTTCATGGCCACCAGACCATAATTGGTACTTT
GCCAAACTCAACATCACTAATGTCTGGGTCCTGGACTACTTTGGTGGAATCAAAACTGTG
ACACCAGAAGACTATTTTAATGCTACACCCTAG

braker.aa
>g7266.t1
MWVLVALLAAAAGALGIPPHEDAARVARFVVHSCNWGALATLSSQDPPMRGQPFSNVFSV
SDGPATTSGTGVPYMYLTGLDVSVHDLQVNANASLTMSLAQTSYCKSKGYDPQSPLCAHV
IFSGVVEKVPNGTETDFAKIALFSRHPEMASWPPDHNWYFAKLNITNVWVLDYFGGIKTV
TPEDYFNATP*