Too many genes predicted

Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes

Other

334 stars 80 forks source link

Too many genes predicted #65

Closed SvitlanaLukicheva closed 4 years ago

SvitlanaLukicheva commented 4 years ago

Hello,

I performed an annotation of my genome (N50 = 326 kb) with BRAKER with RNA-Seq. Repeats were softmasked with RepeatModeler + RepeatMasker and the RNA-Seq data was aligned to the softmasked genome with hisat2.

The run succeeded, but the number of predicted genes it too high compared to what was expected. We expect to have +/- 20 k genes, but BRAKER predicted 54 k genes.

This is the first time I work on a genome annotation so it is not clear to me whether the problem comes from BRAKER, from the parameters I used or maybe it is a common behavior and there is a way to filter out some genes? Maybe according to their score in the gtf file?

KatharinaHoff commented 4 years ago

I recommend that you visualize your annotation in a genome browser in context with the annotation supporting RNA-Seq data. GBrowse2, JBrowse or the UCSC Genome Browser are example tools for this. Try MakeHub (https://github.com/Gaius-Augustus/MakeHub) if you want to use the UCSC Genome Browser. Possible reasons for a huge number of genes include:

(a) low assembly quality (e.g. wrong stop codons that interrupt longer genes and lead to a gene split; or short contigs with partial genes), (b) insufficient repeat masking, will e.g. lead to the predictin of many copies of transposons (you can check this by BLASTing or DIAMONDing the proteins against each other and count the number of high quality hits), (c) an actual AUGUSTUS problem with this species that might lead to split genes where there should be no split.

Katharina

SvitlanaLukicheva commented 4 years ago

Hello KatharinaHoff,

Thank you very much for your help!

I visualized the annotation with IVG and the predictions seem consistent with the RNA-Seq data.

Following your suggestion, I BLASTed the predicted genes against themselves and I obtained 712k matches, meaning that each gene matches in average 13 other genes. Does it mean that I have a problem with my repeat masking? To be sure that softmasked parts of the genome weren't annotated I just tried to run BRAKER on a hardmasked genome but the result is similar.

KatharinaHoff commented 4 years ago

If you have an average of 13 matches for each predicted gene, you will likely find a group of genes that has very many matches, and "normal genes" that have none or something like one or two paralogues. You want to identify these groups. Your main interest is probably in the genes that have few paralogues because the ones that have a high number of paralogues will be something like transposable elements or similar. You probably don't have to rerun anything. If you say the gene models are in general well supported, you might just want to separate those two groups of genes.

SvitlanaLukicheva commented 4 years ago

Thank you again for your help! The first investigations of the predicted genes confirm your suggestion. :)