Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
368 stars 81 forks source link

Internal UTR features were found after decorating UTR by stringtie2utr.py #867

Open xo2003 opened 1 month ago

xo2003 commented 1 month ago

Hi, Since UTR is a part of the exon feature in molecular biology, I attempted to extend the start coordinate of the first exon (for forward strand transcripts, 5’ -> 3’) or the end coordinate of the first exon (for reverse strand transcripts, 3’ -> 5’) of each transcript that has a UTR feature. After converting the GTF file to GFF3 format and running the gff_QC tool from GFF3toolkit , I discovered that stringtie2utr.py created internal UTR features, which caused the START position to be greater than the END position after adjusting the exon coordinates. The image below, cropped from the original GTF after adding UTRs, shows internal 5' UTR features. Internal 3' UTR features were also found. Approximately 100 transcripts are affected by this issue.

image image

Additionally, some exon from StringTie remain in GTF. image image

By applying stringtie2utr.py to two of our genomes, both of them were found to have internal UTR. The stringtie GFF for decorate UTR is from ${BRAKER3_OUT}/GeneMark-ETP/rnaseq/stringtie/transcripts_merged.gff.

Here is the RNA library information for the two genomes: [Genome1] 7 RNA-seq libraries: 3 un-stranded and 4 reverse-stranded. [Genome2] 49 RNA-seq libraries: all reverse-stranded.

I would appreciate any suggestions to resolve these internal UTR issues. Thank you!

Victaphanta commented 3 weeks ago

Same here.

xo2003 commented 2 days ago

For decorating UTR via --addUTR=on might be a solution for this issues.

But even if I successfully executed test7.sh #831 with the toy data, I still encountered an error while dealing with real data. Since setting --addUTR=on is equivalent to running GUSHR directly #506, I added UTR using GUSHR with Java8.
gushr.py needs to be fixed according to https://github.com/Gaius-Augustus/GUSHR/issues/5.

The exon line of gushr.gtf can be restored by rename_gtf.py (exon position = CDS) in TSEBRA or gtf2gff.pl (exon position including UTR) in Augustus. By this way, the UTR features seem more reliable but still need to be examined carefully.