Gaius-Augustus / TSEBRA

TSEBRA: Transcript Selector for BRAKER
47 stars 5 forks source link

tsebra says skipping transcripts with missing CDS and exons, but which have both in the input gtf files #2

Closed adamfreedman closed 2 years ago

adamfreedman commented 3 years ago

I ran two separate braker jobs: 1 where only protein evidence was provided, the other with rna-seq (supplying an rna-seq bam file as well as intron hints, and also running with --UTR on). I was surprised at the relatively small number of transcripts in the tsebara output relative to the two braker runs 28513 for braker with protein, 5615 for tsebra.

I looked at the stderr log file and there were 25443 transcripts that were skipped. e.g. "Skipping transcript g7750.t1, no CDS nor exons in g7750.t1" . Ignoring the issue of not knowing where the name comes from--because the naming of the two separate braker runs is independent and therefore not mutually exclusive--it is clear that, in the original braker runs, a transcript with this label has both exons and CDS. E.g.

grep g7750.t1 dple_braker_proteinonly.gtf 
DPSCF300050 AUGUSTUS    stop_codon  44694   44696   .   -   0   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    CDS 44694   44897   1   -   0   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    exon    44694   44897   .   -   .   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    start_codon 44895   44897   .   -   0   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    transcript  44694   44897   1   -   .   g7750.t1

There are far more lines (122) in the "with rna-seq" braker run that correspond to this transcript id, with exactly 40 CDS intervals. This extreme "filtering" does not seem to have anything to do with the removal of transcripts based upon the scoring scheme described in the biorXiv paper, because it takes place before the extrinsic evidence is read in. The first few lines in the stderr log are:

READING GENE PREDICTION: [dple_braker_rnaseq.gtf]

READING GENE PREDICTION: [dple_braker_proteinonly.gtf]

Skipping transcript g24461.t1, no CDS nor exons in g24461.t1 Skipping transcript g14264.t1, no CDS nor exons in g14264.t1

this actually seems to suggest that the skipping is for transcripts in the second of the two gtf files listed?

the last few lines are:

Skipping transcript g13846.t1, no CDS nor exons in g13846.t1 Skipping transcript g4156.t1, no CDS nor exons in g4156.t1 Skipping transcript g3211.t1, no CDS nor exons in g3211.t1 Skipping transcript g21356.t1, no CDS nor exons in g21356.t1

READING EXTRINSIC EVIDENCE: [filtered_namesorted_dple_hisat-star_merged.hints.gff]

BUILD OVERLAP GRAPH

ADD FEATURES TO TRANSCRIPTS

SELECT TRANSCRIPTS

WRITE COMBINED GENE PREDICTION

FINISHED

...which I why I conclude that transcript skipping has nothing to do with the scores-based removal

Can you explain this behavior, because it certainly seems like a bug, or at bare minimum sub-optimal behavior.

LarsGab commented 3 years ago

Hi, sorry for the late reply. This error message usually indicates that a single transcript line with a unique transcript ID is included in your input files. For example in your output for "g7750.t1":

DPSCF300050 AUGUSTUS    stop_codon  44694   44696   .   -   0   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    CDS 44694   44897   1   -   0   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    exon    44694   44897   .   -   .   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";
DPSCF300050 AUGUSTUS    start_codon 44895   44897   .   -   0   transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750";

These lines belong to a transcript with ID "file_1_file_1_g7750.t1", this transcript is used by TSEBRA.

DPSCF300050 AUGUSTUS    transcript  44694   44897   1   -   .   g7750.t1

This transcript line has the ID "g7750.t1" which is different from the other lines. TSEBRA doesn't find any CDS for this ID in the input file and prints a message that this line isn't used.

The fact that your TSEBRA output includes only includes 5615 transcripts seems a bit strange. The default configuration might be too strict for your case. I added a second configuration file config/pref_braker1.cfg which you might want to try.

Best, Lars

adamfreedman commented 3 years ago

Hi Lars,

So ... I am a little bit confused here. In the above referenced gtf, produced by a single run of braker, a) transcript file_1_file_1_g7750.t1 has no "transcript" level feature, b) the coordinates for transcript g7750.t1 --which you are saying has no CDS or Exon features--looks very much like the transcript entry for transcript file_1_file_1_g7750.t1 , as the genomic coordinates match exactly, and c) there is a transcript g7750.t1 with no other associated CDS or Exon features.

From the outside, it looks more like g7750.t1 and file_1_file_1_g7750.t1 are all part of the same transcript model output by Braker but that Braker is doing something weird with naming conventions that doesn't match the gtf standard. And, as a result, Tsebra flags g7750.t1 as having no features. Btw shouldn't Tsebra also skip file_1_file_1_g7750.t1 because it has no Transcript features?

If this isn't a Braker naming convention problem, then the implication is that Braker can produce two different transcripts for the exact same genomic interval, one with no CDS or Exon features, and the other with no Transcript feature. Besides being weird and downright silly, that creates all sorts of headaches for downstream parsing and analysis.

LarsGab commented 3 years ago

You are correct in that it looks like that the transcript line from g7750.t1 and file_1_file_1_g7750.t1 belong to the same transcript model in Braker. This ID convention problem was made by Braker and I have also encountered it before.

Tsebra uses the transcript IDs in the last column to collect all lines associated with a transcript. During the algorithm, two different transcripts (one for g7750.t1 and one for file_1_file_1_g7750.t1) are created in this example. The program notices that there are no CDS or exon features with transcript ID g7750.t1. Therefore, it deletes this 'empty' transcript and prints the error message. Tsebra doesn't require a transcript feature line to construct a transcript correctly and therefore it doesn't complain about transcript file_1_file_1_g7750.t1 and uses it correctly (it shouldn't be skipped).

Regarding the issue with the Braker IDs, I've previously written a small script to fix the ID errors. You can find it at bin/fix_gtf_ids.py and in the README.md is an example of how to use it. Applying bin/fix_gtf_ids.py before using Tsebra should let you get rid of the error messages. However, the redundant transcript lines don't affect the result of Tsebra and the result should be the same.

adamfreedman commented 3 years ago

supporting my view that there is naming convention problem when the braker.gtf gets written, I looked at the gtf files produce in the braker output file, the augustus.hints.gtf and the final braker.gtf

grep g7750.t1 augustus.hints.gtf DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1 DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "g7750.t1"; gene_id "g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "g7750.t1"; gene_id "g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "g7750.t1"; gene_id "g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "g7750.t1"; gene_id "g7750";

AND

[afreedman@bioinf01 braker]$ grep g7750.t1 braker.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1

NOTE that the transcript model entries are IDENTICAL, except that "file_1_file1" has been added to the transcript model in the final braker.gtf output

LarsGab commented 3 years ago

I just remembered that I saw a Braker issue about this topic and looked it up: https://github.com/Gaius-Augustus/BRAKER/issues/354 In the newest version of Braker the issue might be fixed, but I haven't tested it myself.

For Tsebra, you can either ignore the transcript ID and gene ID issue or use bin/fix_gtf_ids.py as mentioned above.

adamfreedman commented 3 years ago

Thanks for pointing out the braker post, and reposting the issue here on the braker page. I think it is also worth pointing out that the comments posted by Dr. Hoff wrt the issue about transcripts with stop codons but no CDS aren't directly relevant here (although they possibly contribute to the problem?) because my example is a case where there are start and stop codon, exon AND CDS features. I have a strong suspicion that we are working with the most current version of braker , as we have been actively working on getting it to work on our cluster, but I will also look at the py script and hope that solves the problem

adamfreedman commented 3 years ago

yeah, it looks like one of our pull requests was accepted a little over a week ago, and previous commits predate when we would have pulled. so the issue remains it seems ... at least before one tries to run the "fix it" python script

adamfreedman commented 3 years ago

btw ... absent a current fix from within braker, perhaps bin/fix_gtf_ids.py should get added to the braker distribution rather than have people need tsebra installed to run it

adamfreedman commented 3 years ago

one more thing. I just ran fix_gtf_ids.py. Besides updating the exon, CDS, start_codon and stop_codon features, it discards the transcript feature.

dple_braker_proteinonly.gtf [afreedman@bioinf01 dple]$ grep g7750.t1 idsfixed_dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; [afreedman@bioinf01 dple]$ grep g7750.t1 dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1

In fact, the py script discards all transcript level features. I think this is problematic and needs to be addressed.

LarsGab commented 3 years ago

Thank you for your comments on this topic. The script fix_gtf_ids.py is very basic and I wrote it specifically for Tsebra. I think a script included in the Braker repository should address all format and ID issues.

tinyfallen commented 3 years ago

one more thing. I just ran fix_gtf_ids.py. Besides updating the exon, CDS, start_codon and stop_codon features, it discards the transcript feature.

dple_braker_proteinonly.gtf [afreedman@bioinf01 dple]$ grep g7750.t1 idsfixed_dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "DPSCF300050-_file_1_file_1_g7750.t1"; gene_id "DPSCF300050-_file_1_file_1_g7750"; [afreedman@bioinf01 dple]$ grep g7750.t1 dple_braker_proteinonly.gtf DPSCF300050 AUGUSTUS stop_codon 44694 44696 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS CDS 44694 44897 1 - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS exon 44694 44897 . - . transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS start_codon 44895 44897 . - 0 transcript_id "file_1_file_1_g7750.t1"; gene_id "file_1_file_1_g7750"; DPSCF300050 AUGUSTUS transcript 44694 44897 1 - . g7750.t1

In fact, the py script discards all transcript level features. I think this is problematic and needs to be addressed.

Hi , how did you deal with the issue of incompatible ids? I met this problem when using the output gtf files of braker which could not correctly transformed to gff3 as the input of maker. They lost gene and mRNA features. So I simply used $sed 's/file_1_file1//g' -i braker*.gtf and manually ran the gtf2gff.pl scripts in augustus to change format. The output gff3 files seem to work without annoying errors but I can not determine whether it is really correct or not.

LarsGab commented 3 years ago

Hi, we added the script bin/rename_gtf.py to the TSEBRA repository to rename the IDs so that they match. (See TSEBRA documentation on how to use it). Best, Lars

tinyfallen commented 3 years ago

Hi, we added the script bin/rename_gtf.py to the TSEBRA repository to rename the IDs so that they match. (See TSEBRA documentation on how to use it). Best, Lars

Thanks for your reply and I would like to tell that I downloaded the tsebra.tar.gz from the repository https://github.com/Gaius-Augustus/TSEBRA/archive/refs/tags/v1.0.2.tar.gz which do not contain the new python script, and then I obtained it by git clone. Maybe you can spend some time to do the favor updating the package so that it will not confuse subsequent users~ Thank you very much for your software!

LarsGab commented 3 years ago

Thanks for reminding me, I published a new version. Best, Lars

tinyfallen commented 3 years ago

By the way , I have another question , how can I add braker-generated UTR features into TSEBRA out ?

LarsGab commented 3 years ago

See issue #10 for this question.

LarsGab commented 2 years ago

I close this issue since all questions have been answered.