Closed adamfreedman closed 2 years ago
Hi, sorry for the late reply. This error message usually indicates that a single transcript line with a unique transcript ID is included in your input files. For example in your output for "g7750.t1":
DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
These lines belong to a transcript with ID "file_1_file_1_g7750.t1", this transcript is used by TSEBRA.
DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1
This transcript line has the ID "g7750.t1" which is different from the other lines. TSEBRA doesn't find any CDS for this ID in the input file and prints a message that this line isn't used.
The fact that your TSEBRA output includes only includes 5615 transcripts seems a bit strange. The default configuration might be too strict for your case. I added a second configuration file config/pref_braker1.cfg
which you might want to try.
Best, Lars
Hi Lars,
So ... I am a little bit confused here. In the above referenced gtf, produced by a single run of braker, a) transcript file_1_file_1_g7750.t1 has no "transcript" level feature, b) the coordinates for transcript g7750.t1 --which you are saying has no CDS or Exon features--looks very much like the transcript entry for transcript file_1_file_1_g7750.t1 , as the genomic coordinates match exactly, and c) there is a transcript g7750.t1 with no other associated CDS or Exon features.
From the outside, it looks more like g7750.t1 and file_1_file_1_g7750.t1 are all part of the same transcript model output by Braker but that Braker is doing something weird with naming conventions that doesn't match the gtf standard. And, as a result, Tsebra flags g7750.t1 as having no features. Btw shouldn't Tsebra also skip file_1_file_1_g7750.t1 because it has no Transcript features?
If this isn't a Braker naming convention problem, then the implication is that Braker can produce two different transcripts for the exact same genomic interval, one with no CDS or Exon features, and the other with no Transcript feature. Besides being weird and downright silly, that creates all sorts of headaches for downstream parsing and analysis.
You are correct in that it looks like that the transcript line from g7750.t1 and file_1_file_1_g7750.t1 belong to the same transcript model in Braker. This ID convention problem was made by Braker and I have also encountered it before.
Tsebra uses the transcript IDs in the last column to collect all lines associated with a transcript. During the algorithm, two different transcripts (one for g7750.t1 and one for file_1_file_1_g7750.t1) are created in this example. The program notices that there are no CDS or exon features with transcript ID g7750.t1. Therefore, it deletes this 'empty' transcript and prints the error message. Tsebra doesn't require a transcript feature line to construct a transcript correctly and therefore it doesn't complain about transcript file_1_file_1_g7750.t1 and uses it correctly (it shouldn't be skipped).
Regarding the issue with the Braker IDs, I've previously written a small script to fix the ID errors. You can find it at bin/fix_gtf_ids.py
and in the README.md is an example of how to use it. Applying bin/fix_gtf_ids.py
before using Tsebra should let you get rid of the error messages. However, the redundant transcript lines don't affect the result of Tsebra and the result should be the same.
supporting my view that there is naming convention problem when the braker.gtf gets written, I looked at the gtf files produce in the braker output file, the augustus.hints.gtf and the final braker.gtf
grep g7750.t1 augustus.hints.gtf DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1 DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "g7750.t1"; gene_id "g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "g7750.t1"; gene_id "g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "g7750.t1"; gene_id "g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "g7750.t1"; gene_id "g7750";
AND
[afreedman@bioinf01 braker]$ grep g7750.t1 braker.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1
NOTE that the transcript model entries are IDENTICAL, except that "file_1_file1" has been added to the transcript model in the final braker.gtf output
I just remembered that I saw a Braker issue about this topic and looked it up: https://github.com/Gaius-Augustus/BRAKER/issues/354 In the newest version of Braker the issue might be fixed, but I haven't tested it myself.
For Tsebra, you can either ignore the transcript ID and gene ID issue or use bin/fix_gtf_ids.py
as mentioned above.
Thanks for pointing out the braker post, and reposting the issue here on the braker page. I think it is also worth pointing out that the comments posted by Dr. Hoff wrt the issue about transcripts with stop codons but no CDS aren't directly relevant here (although they possibly contribute to the problem?) because my example is a case where there are start and stop codon, exon AND CDS features. I have a strong suspicion that we are working with the most current version of braker , as we have been actively working on getting it to work on our cluster, but I will also look at the py script and hope that solves the problem
yeah, it looks like one of our pull requests was accepted a little over a week ago, and previous commits predate when we would have pulled. so the issue remains it seems ... at least before one tries to run the "fix it" python script
btw ... absent a current fix from within braker, perhaps bin/fix_gtf_ids.py should get added to the braker distribution rather than have people need tsebra installed to run it
one more thing. I just ran fix_gtf_ids.py. Besides updating the exon, CDS, start_codon and stop_codon features, it discards the transcript feature.
dple_braker_proteinonly.gtf [afreedman@bioinf01 dple]$ grep g7750.t1 idsfixed_dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; [afreedman@bioinf01 dple]$ grep g7750.t1 dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1
In fact, the py script discards all transcript level features. I think this is problematic and needs to be addressed.
Thank you for your comments on this topic. The script fix_gtf_ids.py is very basic and I wrote it specifically for Tsebra. I think a script included in the Braker repository should address all format and ID issues.
one more thing. I just ran fix_gtf_ids.py. Besides updating the exon, CDS, start_codon and stop_codon features, it discards the transcript feature.
dple_braker_proteinonly.gtf [afreedman@bioinf01 dple]$ grep g7750.t1 idsfixed_dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; [afreedman@bioinf01 dple]$ grep g7750.t1 dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1
In fact, the py script discards all transcript level features. I think this is problematic and needs to be addressed.
Hi , how did you deal with the issue of incompatible ids? I met this problem when using the output gtf files of braker which could not correctly transformed to gff3 as the input of maker. They lost gene and mRNA features. So I simply used $sed 's/file_1_file1//g' -i braker*.gtf and manually ran the gtf2gff.pl scripts in augustus to change format. The output gff3 files seem to work without annoying errors but I can not determine whether it is really correct or not.
Hi, we added the script bin/rename_gtf.py
to the TSEBRA repository to rename the IDs so that they match. (See TSEBRA documentation on how to use it).
Best,
Lars
Hi, we added the script
bin/rename_gtf.py
to the TSEBRA repository to rename the IDs so that they match. (See TSEBRA documentation on how to use it). Best, Lars
Thanks for your reply and I would like to tell that I downloaded the tsebra.tar.gz from the repository https://github.com/Gaius-Augustus/TSEBRA/archive/refs/tags/v1.0.2.tar.gz which do not contain the new python script, and then I obtained it by git clone. Maybe you can spend some time to do the favor updating the package so that it will not confuse subsequent users~ Thank you very much for your software!
Thanks for reminding me, I published a new version. Best, Lars
By the way , I have another question , how can I add braker-generated UTR features into TSEBRA out ?
See issue #10 for this question.
I close this issue since all questions have been answered.
I ran two separate braker jobs: 1 where only protein evidence was provided, the other with rna-seq (supplying an rna-seq bam file as well as intron hints, and also running with --UTR on). I was surprised at the relatively small number of transcripts in the tsebara output relative to the two braker runs 28513 for braker with protein, 5615 for tsebra.
I looked at the stderr log file and there were 25443 transcripts that were skipped. e.g. "Skipping transcript g7750.t1, no CDS nor exons in g7750.t1" . Ignoring the issue of not knowing where the name comes from--because the naming of the two separate braker runs is independent and therefore not mutually exclusive--it is clear that, in the original braker runs, a transcript with this label has both exons and CDS. E.g.
There are far more lines (122) in the "with rna-seq" braker run that correspond to this transcript id, with exactly 40 CDS intervals. This extreme "filtering" does not seem to have anything to do with the removal of transcripts based upon the scoring scheme described in the biorXiv paper, because it takes place before the extrinsic evidence is read in. The first few lines in the stderr log are:
READING GENE PREDICTION: [dple_braker_rnaseq.gtf]
READING GENE PREDICTION: [dple_braker_proteinonly.gtf]
Skipping transcript g24461.t1, no CDS nor exons in g24461.t1 Skipping transcript g14264.t1, no CDS nor exons in g14264.t1
this actually seems to suggest that the skipping is for transcripts in the second of the two gtf files listed?
the last few lines are:
Skipping transcript g13846.t1, no CDS nor exons in g13846.t1 Skipping transcript g4156.t1, no CDS nor exons in g4156.t1 Skipping transcript g3211.t1, no CDS nor exons in g3211.t1 Skipping transcript g21356.t1, no CDS nor exons in g21356.t1
READING EXTRINSIC EVIDENCE: [filtered_namesorted_dple_hisat-star_merged.hints.gff]
BUILD OVERLAP GRAPH
ADD FEATURES TO TRANSCRIPTS
SELECT TRANSCRIPTS
WRITE COMBINED GENE PREDICTION
FINISHED
...which I why I conclude that transcript skipping has nothing to do with the scores-based removal
Can you explain this behavior, because it certainly seems like a bug, or at bare minimum sub-optimal behavior.