Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
347 stars 79 forks source link

Using "long-reads" resulted in a lower number of annotated genes #461

Closed yulong1227 closed 8 months ago

yulong1227 commented 2 years ago

Hello! I use long read RNA seq via the new branch “long-reads" to annotate a certain mammalian genome. https://github.com/Gaius-Augustus/BRAKER/blob/master/docs/long_reads/long_read_protocol.md

When I used only short-read RNA-Seq data and proteins combined by TSEBRA, using swiss-port annotation got the appropriate number of genes and encoded proteins.

But when I combined short-read RNA-Seq with long read RNA-seq using the same pipeline, the number of results annotated dropped by a fifth.

I guess it could be the parameter setting, or something else?

Looking forward to your reply!

LarsGab commented 2 years ago

Hi,

thanks for trying out our new long read protocol. The reason for the decrease in the number of transcripts is probably the long-read configuration for TSEBRA. If you want to try to adjust the long-read (configuration)[https://github.com/Gaius-Augustus/TSEBRA/blob/long_reads/config/long_reads_filtered.cfg] of TSEBRA, I recommend you decrease the 'intron_support' parameter (e.g. to 0.8). This might increase the number of genes in your result. If you want to keep more transcript isoforms per gene, I recommend that you increase the parameters 'e_4', 'e_5', 'e_6' (e.g. increase by 300). For more information about the parameters, see our (TSEBRA paper)[https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-021-04482-0] or take a look at (this)[https://github.com/Gaius-Augustus/TSEBRA/issues/13] issue, where I gave an informal description of these parameters. I hope this helps you. Best, Lars

yulong1227 commented 2 years ago

Thank you for your useful advice! The number of comments has been greatly improved after using "long_reads_filtered.cfg". However, the results obtained using different parameter annotations also have certain differences, and some may differ by several hundred, and the results that are too high or too low are unreliable. I would like to ask you how to find that annotation result that is closest to the true value among different parameters? Or, are there other evaluation metrics to find the most convincing result? Looking forward to your reply!

LarsGab commented 2 years ago

Hi, for different parameters, it is unfortunately not clear which TSEBRA output is more accurate than the others. However, you can try to evaluate the gene sets with BUSCO to get a metric for the sensitivity of your results. Also, you can compare the gene sets to a reference annotation to evaluate them, if you find one for your species. Best, Lars

amvarani commented 2 years ago

Hi there! Similar issue here! Benchmarking the annotation with BUSCO gave me the following results:

TSEBRA (no long reads protocol) C:98.7%[S:92.1%,D:6.6%],F:0.7%,M:0.6%,n:1614 8 Missing BUSCOs (M)

TSEBRA (long reads protocol) C:97.1%[S:85.6%,D:11.5%],F:0.9%,M:2.0%,n:1614 32 Missing BUSCOs (M)