How to get a full merge of protein-based and RNA-seq-based predictions without rejecting any?

SchwarzEM commented 2 years ago

I have run BRAKER2 with both protein-based and RNA-seq-based evidence for a parasitic nematode genome and then merged the predictions using TSEBRA. I have then tested the completeness of gene predictions for the BRAKER2 runs using 3,131 nematode-specific index genes in BUSCO (version 5). What I find frustrating is that I can see that TSEBRA is very likely to be discarding some real nematode genes. Here are the BUSCO completeness scores that I see for various BRAKER2 gene sets for my new genome:

Naive protein-and-RNA-seq evidence: 86.0%
Protein-only evidence mode: 85.1%
RNA-seq-only evidence: 85.6%
Proteins predicted by [2] and [3], pooled naively: 88.2%
Proteins predicted by the TSEBRA merge of [2] and [3]: 86.4%

The good news is that TSEBRA does bump up the observed completeness a bit from what I would get simply by running BRAKER2 with --etpmode (from 86.0% to 86.4%). The bad news is that TSEBRA is clearly leaving roughly 1.8% of real genes on the table (as seen by comparing 86.4% to 88.2% BUSCO scores), presumably because TSEBRA tries to filter out weaker predictions in favor of stronger predictions. Option [4] lets me see that there exist roughly 1.8% real genes that are being missed by TSEBRA, but pooling two proteomes is a clumsy hack and does not give me a coherent gene set which I can proceed with.

It would be very useful if there was some way to merge all of the gene predictions from both [2] and [3] without discarding any of them, with overlaps being treated as isoforms of a single gene, and with the final product having consistent gene/transcript names. That would perhaps let me achieve 88.2% BUSCO completeness, which TSEBRA is not currently allowing.

Is there some way to adjust TSEBRA parameters so that such a merge is possible? Barring that, is there some other software program which would make such merges easy to do?

LarsGab commented 2 years ago

Hi, You can do this with the --keep_gtf option of TSEBRA. Here you can specify gene sets in the same way as with the --gtf option, with the difference that all transcripts of these gene sets are kept in the result. If you want the union of two gene sets, specify them with the --keep_gtf option instead of the --gtf option.

Best, Lars

SchwarzEM commented 2 years ago

Hi Lars,

Your advice worked, and I now have a proteome with optimal BUSCO completeness. Thank you very much!

Since this was not obvious from the TSEBRA documentation or from the built-in --help message, would it be possible to add an explanation of the different results from the -g versus -k arguments to TSEBRA's documentation?

Thanks again,

--Erich

Gaius-Augustus / TSEBRA

How to get a full merge of protein-based and RNA-seq-based predictions without rejecting any? #20