Closed nhartwic closed 1 year ago
Hi,
TSEBRA considers all transcripts to be in one gene if they have overlapping coding regions in the same frame. I think this might be the problem here, as these transcripts are in different frames. I added a new option to TSEBRA (--ignore_tx_phase) to address this. With this option, TSEBRA ignores the frame of transcripts and in your case, it should include all transcript isoforms into one gene model.
Best, Lars
I'm currently experimenting with tsebra and have noticed a strange output. Basically, I ran braker with a protein database and braker with rnaseq and then ran tsebra. Output mostly looks good. There are a lot of TE related genes (or at least genes/CDS with significant overlap with my softmask). But there also seems to just be a bug.
Basically, I have two gene models in the same strand with the coordinates essentially contained within the other. Images below...
- top track is my tsebra output after converting to gff3
- second track is the raw output from tsebra in native gtf format
- third track is braker with rnaseq
- last track is braker with proteins
The thin strands for the gff3 represent "gene" features. For gtf, I need to hover to see gene ids but the tsebra gtf and gff3 are consistent. Relevant portions of gtfs below...
# tsebra chr_8h AUGUSTUS gene 64993239 64996898 . - . g_9280 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS CDS 64993239 64994892 0.77 - 1 transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS exon 64993239 64994892 . - . transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS intron 64994893 64995240 0.77 - . transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS CDS 64995241 64995328 0.95 - 2 transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS exon 64995241 64995328 . - . transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS intron 64995329 64996558 0.94 - . transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS CDS 64996559 64996898 1 - 0 transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS exon 64996559 64996898 . - . transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS start_codon 64996896 64996898 . - 0 transcript_id "anno1.file_1_file_1_g33709.t6"; gene_id "g_9280"; chr_8h AUGUSTUS gene 64993239 64994933 . - . g_2084 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t4"; gene_id "g_2084"; chr_8h AUGUSTUS CDS 64993239 64994333 1 - 0 transcript_id "anno1.file_1_file_1_g33709.t4"; gene_id "g_2084"; chr_8h AUGUSTUS exon 64993239 64994333 . - . transcript_id "anno1.file_1_file_1_g33709.t4"; gene_id "g_2084"; chr_8h AUGUSTUS start_codon 64994331 64994333 . - 0 transcript_id "anno1.file_1_file_1_g33709.t4"; gene_id "g_2084"; chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t3"; gene_id "g_2084"; chr_8h AUGUSTUS CDS 64993239 64993535 1 - 0 transcript_id "anno1.file_1_file_1_g33709.t3"; gene_id "g_2084"; chr_8h AUGUSTUS exon 64993239 64993535 . - . transcript_id "anno1.file_1_file_1_g33709.t3"; gene_id "g_2084"; chr_8h AUGUSTUS start_codon 64993533 64993535 . - 0 transcript_id "anno1.file_1_file_1_g33709.t3"; gene_id "g_2084"; chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t2"; gene_id "g_2084"; chr_8h AUGUSTUS CDS 64993239 64993952 1 - 0 transcript_id "anno1.file_1_file_1_g33709.t2"; gene_id "g_2084"; chr_8h AUGUSTUS exon 64993239 64993952 . - . transcript_id "anno1.file_1_file_1_g33709.t2"; gene_id "g_2084"; chr_8h AUGUSTUS start_codon 64993950 64993952 . - 0 transcript_id "anno1.file_1_file_1_g33709.t2"; gene_id "g_2084"; chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t1"; gene_id "g_2084"; chr_8h AUGUSTUS CDS 64993239 64994132 1 - 0 transcript_id "anno1.file_1_file_1_g33709.t1"; gene_id "g_2084"; chr_8h AUGUSTUS exon 64993239 64994132 . - . transcript_id "anno1.file_1_file_1_g33709.t1"; gene_id "g_2084"; chr_8h AUGUSTUS start_codon 64994130 64994132 . - 0 transcript_id "anno1.file_1_file_1_g33709.t1"; gene_id "g_2084"; chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t5"; gene_id "g_2084"; chr_8h AUGUSTUS CDS 64993239 64994615 1 - 0 transcript_id "anno1.file_1_file_1_g33709.t5"; gene_id "g_2084"; chr_8h AUGUSTUS exon 64993239 64994615 . - . transcript_id "anno1.file_1_file_1_g33709.t5"; gene_id "g_2084"; chr_8h AUGUSTUS start_codon 64994613 64994615 . - 0 transcript_id "anno1.file_1_file_1_g33709.t5"; gene_id "g_2084"; chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "anno1.file_1_file_1_g33709.t7"; gene_id "g_2084"; chr_8h AUGUSTUS CDS 64993239 64994933 0.84 - 0 transcript_id "anno1.file_1_file_1_g33709.t7"; gene_id "g_2084"; chr_8h AUGUSTUS exon 64993239 64994933 . - . transcript_id "anno1.file_1_file_1_g33709.t7"; gene_id "g_2084"; chr_8h AUGUSTUS start_codon 64994931 64994933 . - 0 transcript_id "anno1.file_1_file_1_g33709.t7"; gene_id "g_2084"; # braker with proteins chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t4"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64994333 1 - 0 transcript_id "file_1_file_1_g33709.t4"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64994333 . - . transcript_id "file_1_file_1_g33709.t4"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64994331 64994333 . - 0 transcript_id "file_1_file_1_g33709.t4"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS transcript 64993239 64994333 1 - . g33709.t4 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t3"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64993535 1 - 0 transcript_id "file_1_file_1_g33709.t3"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64993535 . - . transcript_id "file_1_file_1_g33709.t3"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64993533 64993535 . - 0 transcript_id "file_1_file_1_g33709.t3"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS transcript 64993239 64993535 1 - . g33709.t3 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64994892 0.77 - 1 transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64994892 . - . transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS intron 64994893 64995240 0.77 - . transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64995241 64995328 0.95 - 2 transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64995241 64995328 . - . transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS intron 64995329 64996558 0.94 - . transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64996559 64996898 1 - 0 transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS gene 64993239 64996898 6.6 - . g33709 chr_8h AUGUSTUS transcript 64993239 64996898 0.76 - . g33709.t6 chr_8h AUGUSTUS exon 64996559 64996898 . - . transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64996896 64996898 . - 0 transcript_id "file_1_file_1_g33709.t6"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t1"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64994132 1 - 0 transcript_id "file_1_file_1_g33709.t1"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64994132 . - . transcript_id "file_1_file_1_g33709.t1"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64994130 64994132 . - 0 transcript_id "file_1_file_1_g33709.t1"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS transcript 64993239 64994132 1 - . g33709.t1 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t5"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64994615 1 - 0 transcript_id "file_1_file_1_g33709.t5"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64994615 . - . transcript_id "file_1_file_1_g33709.t5"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64994613 64994615 . - 0 transcript_id "file_1_file_1_g33709.t5"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS transcript 64993239 64994615 1 - . g33709.t5 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t7"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64994933 0.84 - 0 transcript_id "file_1_file_1_g33709.t7"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64994933 . - . transcript_id "file_1_file_1_g33709.t7"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64994931 64994933 . - 0 transcript_id "file_1_file_1_g33709.t7"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS transcript 64993239 64994933 0.84 - . g33709.t7 chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g33709.t2"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS CDS 64993239 64993952 1 - 0 transcript_id "file_1_file_1_g33709.t2"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS exon 64993239 64993952 . - . transcript_id "file_1_file_1_g33709.t2"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS start_codon 64993950 64993952 . - 0 transcript_id "file_1_file_1_g33709.t2"; gene_id "file_1_file_1_g33709"; chr_8h AUGUSTUS transcript 64993239 64993952 1 - . g33709.t2 # braker with rnaseq chr_8h AUGUSTUS stop_codon 64993239 64993241 . - 0 transcript_id "file_1_file_1_g34062.t1"; gene_id "file_1_file_1_g34062"; chr_8h AUGUSTUS CDS 64993239 64994933 0.65 - 0 transcript_id "file_1_file_1_g34062.t1"; gene_id "file_1_file_1_g34062"; chr_8h AUGUSTUS exon 64993239 64994933 . - . transcript_id "file_1_file_1_g34062.t1"; gene_id "file_1_file_1_g34062"; chr_8h AUGUSTUS start_codon 64994931 64994933 . - 0 transcript_id "file_1_file_1_g34062.t1"; gene_id "file_1_file_1_g34062"; chr_8h AUGUSTUS gene 64993239 64994933 0.65 - . g34062 chr_8h AUGUSTUS transcript 64993239 64994933 0.65 - . g34062.t1
In terms of software, I used the current braker package from conda after manually installing genemark. And I'm using the latest (as of a couple days ago anyway) version of tsebra-main.
I'm writing a script to fix this in the pipeline I'm writing, but I figured it was worth reporting the bug here too.
Curioius, why coordinates of CDS and exons are same?
I don't believe tsebra predicts UTR (or at least it doesn't do it by default) so exons and CDS should have the same coordinates. What were you expecting?
I'm currently experimenting with tsebra and have noticed a strange output. Basically, I ran braker with a protein database and braker with rnaseq and then ran tsebra. Output mostly looks good. There are a lot of TE related genes (or at least genes/CDS with significant overlap with my softmask). But there also seems to just be a bug.
Basically, I have two gene models in the same strand with the coordinates essentially contained within the other. Images below...
The thin strands for the gff3 represent "gene" features. For gtf, I need to hover to see gene ids but the tsebra gtf and gff3 are consistent. Relevant portions of gtfs below...
In terms of software, I used the current braker package from conda after manually installing genemark. And I'm using the latest (as of a couple days ago anyway) version of tsebra-main.
I'm writing a script to fix this in the pipeline I'm writing, but I figured it was worth reporting the bug here too.