Closed SchwarzEM closed 1 year ago
Hi,
You can do this with the --keep_gtf
option of TSEBRA. Here you can specify gene sets in the same way as with the --gtf
option, with the difference that all transcripts of these gene sets are kept in the result. If you want the union of two gene sets, specify them with the --keep_gtf
option instead of the --gtf
option.
Best, Lars
Hi Lars,
Your advice worked, and I now have a proteome with optimal BUSCO completeness. Thank you very much!
Since this was not obvious from the TSEBRA documentation or from the built-in --help
message, would it be possible to add an explanation of the different results from the -g
versus -k
arguments to TSEBRA's documentation?
Thanks again,
--Erich
I have run BRAKER2 with both protein-based and RNA-seq-based evidence for a parasitic nematode genome and then merged the predictions using TSEBRA. I have then tested the completeness of gene predictions for the BRAKER2 runs using 3,131 nematode-specific index genes in BUSCO (version 5). What I find frustrating is that I can see that TSEBRA is very likely to be discarding some real nematode genes. Here are the BUSCO completeness scores that I see for various BRAKER2 gene sets for my new genome:
The good news is that TSEBRA does bump up the observed completeness a bit from what I would get simply by running BRAKER2 with
--etpmode
(from 86.0% to 86.4%). The bad news is that TSEBRA is clearly leaving roughly 1.8% of real genes on the table (as seen by comparing 86.4% to 88.2% BUSCO scores), presumably because TSEBRA tries to filter out weaker predictions in favor of stronger predictions. Option [4] lets me see that there exist roughly 1.8% real genes that are being missed by TSEBRA, but pooling two proteomes is a clumsy hack and does not give me a coherent gene set which I can proceed with.It would be very useful if there was some way to merge all of the gene predictions from both [2] and [3] without discarding any of them, with overlaps being treated as isoforms of a single gene, and with the final product having consistent gene/transcript names. That would perhaps let me achieve 88.2% BUSCO completeness, which TSEBRA is not currently allowing.
Is there some way to adjust TSEBRA parameters so that such a merge is possible? Barring that, is there some other software program which would make such merges easy to do?