BUSCO completeness decrease after tsebra

yimingweng commented 1 year ago

Hi, Thank you for making this useful tool to handle the separated files from braker. I have a genome assembly trying to annotate with both protein database from orthoDB and RNAseq reads from the closely related species. Here are the results from each approach and the tsebra result:

protein database: C:95.4%[S:86.3%,D:9.1%],F:2.6%,M:2.0% RNAseq read: C:92.2%[S:86.1%,D:6.1%],F:4.2%,M:3.6% Tsebra of two models combined: C:81.0%[S:75.1%,D:5.9%],F:4.5%,M:14.5%

It seems that tsebra did not increase the BUSCO completeness. I wonder if that means the two models have lots of conflicts. Are the 81% completed genes from Tsebra more reliable as they are consensus of the two models?

I also used --keep_gtf to keep the all gene but ends up with high duplication rate (~16%). Would this duplication eventually cause issue for functional annotation tools like diamond or interproscan? Thank you.

YiMing

LarsGab commented 1 year ago

Hi YiMing,

thanks for using TSEBRA. It is difficult to say what the reason for the low BUSCO score in your case is, without looking at the GTF files and without seeing them visualized in a genome browser. I suspect that there are a lot of transcripts that aren't supported by many hints from the extrinsic evidence, which is why TSEBRA filters them out. In your case, I would recommend trying two things: 1. Run TSBERA again and use the gene set predicted with the protein data as input to --keep_gtf and the RNA-Seq one as input to --gtf. This will filter out at least some genes. 2. Run the normal TSEBRA (without --keep_gtf) with a custom configuration file, in which you lower the value for intron_support (e.g. to 0.5) so that fewer transcripts with 'low' evidence support are filtered out. Then I would recommend visualizing and analyzing all gene sets to see what makes the most sense for your data.

Best, Lars

yimingweng commented 1 year ago

Hi, Thank you for the response. I tried what you've said but the busco duplication rate remained high. I guess the closely-related species is not "close" enough so the unsupported transcripts were dropped. Luckily, It seems that the protein database from any evolutionary distance works out pretty well (C:92.9%[S:91.5%,D:1.4%] when considering only the longest isoform). So I might just use the protein trained model for the downstream analyses. Again, thank you very much, I really appreciate your help and designing this tool. Yi-Ming

Gaius-Augustus / TSEBRA

BUSCO completeness decrease after tsebra #23