Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
350 stars 79 forks source link

stringtie2utr generates multiple five_prime_UTRs and three_prime_UTRs within a gene #723

Closed spoonbender76 closed 9 months ago

spoonbender76 commented 9 months ago

Hi,

I tried stringtie2utr.py with the GeneMark-ETP/rnaseq/stringtie/transcripts_merged.gff file to add utr into braker.gtf.

However, I encountered the problem that multiple five_prime_UTRs and three_prime_UTRs are generated within a gene, the same issue as https://github.com/Gaius-Augustus/BRAKER/issues/716#issuecomment-1853508675.

Here are some examples.

chr01   AUGUSTUS    gene    1265570 1337711 .   +   .   g25
chr01   AUGUSTUS    transcript  1265570 1337711 1   +   .   g25.t1
chr01   stringtie2utr   five_prime_UTR  1265570 1265641 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1267245 1267346 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1268300 1268427 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1268857 1269048 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1270085 1270362 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1271057 1271273 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1273003 1273117 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1274180 1274306 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1275368 1275508 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1276316 1276514 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1277498 1277613 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1279421 1279738 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1281067 1281465 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1283176 1283443 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1284457 1284568 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1287752 1287821 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1288516 1288661 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1289245 1289401 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1289880 1290078 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1290804 1291036 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   five_prime_UTR  1291414 1292379 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    start_codon 1292380 1292382 .   +   0   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    CDS 1292380 1292518 1   +   0   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    exon    1292380 1292518 .   +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    intron  1292519 1293897 1   +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    CDS 1293898 1294256 1   +   2   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    exon    1293898 1294256 .   +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   AUGUSTUS    stop_codon  1294254 1294256 .   +   0   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   three_prime_UTR 1294257 1294738 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   stringtie2utr   three_prime_UTR 1337331 1337711 1000    +   .   transcript_id "g25.t1"; gene_id "g25";
chr01   gmst    gene    1600956 1659382 .   -   .   g39
chr01   gmst    transcript  1600956 1659382 .   -   .   g39.t1
chr01   stringtie2utr   three_prime_UTR 1600956 1601209 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1601856 1601983 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1602513 1602581 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1603205 1603301 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1612960 1613142 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1613778 1613862 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1630424 1630588 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   three_prime_UTR 1641347 1641473 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    stop_codon  1641474 1641476 24.335131   -   0   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    CDS 1641474 1641483 24.335131   -   1   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    exon    1641474 1641483 24.335131   -   1   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    intron  1641484 1643629 24.335131   -   0   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    CDS 1643630 1643765 24.335131   -   2   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    exon    1643630 1643765 24.335131   -   2   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    intron  1643766 1646726 24.335131   -   0   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    CDS 1646727 1646898 24.335131   -   0   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    exon    1646727 1646898 24.335131   -   0   transcript_id "g39.t1"; gene_id "g39";
chr01   gmst    start_codon 1646896 1646898 24.335131   -   0   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   five_prime_UTR  1646899 1646901 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   five_prime_UTR  1656810 1656979 1000    -   .   transcript_id "g39.t1"; gene_id "g39";
chr01   stringtie2utr   five_prime_UTR  1659327 1659382 1000    -   .   transcript_id "g39.t1"; gene_id "g39"
KatharinaHoff commented 9 months ago

These are 2 UTRs per transcript. The UTRs are spliced. This is not an error. This results from the stringtie assembly and from the location of the protein coding gene in that assembled transcript. Or do we have overlapping coordinates that I now overlooked?

On Thu, Dec 14, 2023 at 9:04 AM spoonbender76 @.***> wrote:

Hi,

I tried stringtie2utr.py https://github.com/Gaius-Augustus/BRAKER/blob/utr_from_stringtie/scripts/stringtie2utr.py with the GeneMark-ETP/rnaseq/stringtie/transcripts_merged.gff file to add utr into braker.gtf.

However, I encountered the problem that multiple five_prime_UTRs and three_prime_UTRs are generated within a gene, the same issue as #716 (comment) https://github.com/Gaius-Augustus/BRAKER/issues/716#issuecomment-1853508675 .

Here are some examples.

chr01 AUGUSTUS gene 1265570 1337711 . + . g25 chr01 AUGUSTUS transcript 1265570 1337711 1 + . g25.t1 chr01 stringtie2utr five_prime_UTR 1265570 1265641 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1267245 1267346 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1268300 1268427 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1268857 1269048 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1270085 1270362 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1271057 1271273 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1273003 1273117 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1274180 1274306 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1275368 1275508 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1276316 1276514 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1277498 1277613 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1279421 1279738 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1281067 1281465 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1283176 1283443 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1284457 1284568 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1287752 1287821 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1288516 1288661 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1289245 1289401 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1289880 1290078 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1290804 1291036 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr five_prime_UTR 1291414 1292379 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS start_codon 1292380 1292382 . + 0 transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS CDS 1292380 1292518 1 + 0 transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS exon 1292380 1292518 . + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS intron 1292519 1293897 1 + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS CDS 1293898 1294256 1 + 2 transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS exon 1293898 1294256 . + . transcript_id "g25.t1"; gene_id "g25"; chr01 AUGUSTUS stop_codon 1294254 1294256 . + 0 transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr three_prime_UTR 1294257 1294738 1000 + . transcript_id "g25.t1"; gene_id "g25"; chr01 stringtie2utr three_prime_UTR 1337331 1337711 1000 + . transcript_id "g25.t1"; gene_id "g25";

chr01 gmst gene 1600956 1659382 . - . g39 chr01 gmst transcript 1600956 1659382 . - . g39.t1 chr01 stringtie2utr three_prime_UTR 1600956 1601209 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1601856 1601983 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1602513 1602581 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1603205 1603301 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1612960 1613142 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1613778 1613862 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1630424 1630588 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr three_prime_UTR 1641347 1641473 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 gmst stop_codon 1641474 1641476 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst CDS 1641474 1641483 24.335131 - 1 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst exon 1641474 1641483 24.335131 - 1 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst intron 1641484 1643629 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst CDS 1643630 1643765 24.335131 - 2 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst exon 1643630 1643765 24.335131 - 2 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst intron 1643766 1646726 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst CDS 1646727 1646898 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst exon 1646727 1646898 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 gmst start_codon 1646896 1646898 24.335131 - 0 transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr five_prime_UTR 1646899 1646901 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr five_prime_UTR 1656810 1656979 1000 - . transcript_id "g39.t1"; gene_id "g39"; chr01 stringtie2utr five_prime_UTR 1659327 1659382 1000 - . transcript_id "g39.t1"; gene_id "g39"

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JEQMPRBMHUSPOSYOWLYJKXJBAVCNFSM6AAAAABAUNBKDOVHI2DSMVQWIX3LMV43ASLTON2WKOZSGA2DCMJVG43TENQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

spoonbender76 commented 9 months ago

Thank you for your response. I'm still a bit puzzled and would appreciate further clarification. As I understand it - it could be wrong - a transcript should have only one single continuous 5' UTR, starting at the beginning of the transcript and ending just before the start codon, and similarly, one single continuous 3' UTR, beginning right after the stop codon and extending to the end of the transcript. Does this situation mean transcript variants have different UTRs (I'm not sure if they really exist or if it's due to assembly reasons) and these UTRs are all added to the annotation? Or are these multiple 5' UTRs just parts of a large 5' UTR? Should I only reserve one 5' UTR and one 3' UTR, or is it okay to just leave it here?

KatharinaHoff commented 9 months ago

In eukaryotes, UTRs can be spliced. Less frequently so in the 3'UTR, but it also happens there.

This is not to say that all the stringtie assemblies and all the genes are correct. Everything in structural genome annotation may contain errors.

ChuanzhengWei commented 9 months ago

I guess the issue arose because I used transcriptome data from different varieties of the same species (since I didn't perform transcriptome sequencing on my sequenced material). After reads mapping, it's possible that the edges of transcripts of the same gene appeared different. Of course, this is just a speculation, and I haven't checked it with IGV.

KatharinaHoff commented 9 months ago

UTRs inferred from evidence often look differently from reference annotation UTRs and from evidence in an independent experiment.

ChuanzhengWei @.***> schrieb am Fr. 15. Dez. 2023 um 03:53:

I guess the issue arose because I used transcriptome data from different varieties of the same species (since I didn't perform transcriptome sequencing on my sequenced material). After reads mapping, it's possible that the edges of transcripts of the same gene appeared different. Of course, this is just a speculation, and I haven't checked it with IGV.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/723#issuecomment-1857188659, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JGVN3NOLJKGNGONE4LYJO3RPAVCNFSM6AAAAABAUNBKDOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQNJXGE4DQNRVHE . You are receiving this because you were assigned.Message ID: @.***>

KatharinaHoff commented 9 months ago

I will close this issue because I believe there is nothing wrong with the software.