TobyBaril / EarlGrey

Earl Grey: A fully automated TE curation and annotation pipeline
Other
139 stars 20 forks source link

Soft-masking results from Earl Grey for Braker2 annotation #155

Open lyy005 opened 1 week ago

lyy005 commented 1 week ago

Dear Toby, Thank you for making this wonderful tool!

I was trying Earl Grey / Tantan to soft-mask genomes of aphid species and using the masked genomes as the input of Braker2 for gene annotation. I know Tantan only masks simple repeats. And Tantan + Braker2 gave me ~22,000 genes, which is a pretty normal number of protein-coding genes for aphids. However, with the soft-masked genome from Earl Grey, Braker2 gave me ~15,000 genes.

I was wondering if you would recommend using Earl Grey to soft-mask genomes for gene annotation so I won't get as many gene annotations on transposable elements? Or do you think soft-masked genomes from Earl Grey might cause Braker2 to miss real protein-coding genes?

Thank you for any suggestions!

YY

TobyBaril commented 1 day ago

Hi,

In this case there isn't necessarily a wholly correct answer. If using only a mask of simple repeats, there is a pretty high chance that at least some of the genes annotated will be transposable elements, due to the presence of coding domains. On the other hand, using a raw RepeatModeler (or Earl Grey) output will prevent many TEs being annotated as host genes. Where you will need to take some care is in instances of potential multi-copy genes that might be erroneously annotated as TEs due to being detected multiple times by RepeatModeler in the de novo TE detection step. If you are expecting to find some multi-copy genes, this could lead to them being annotated as TE.

The best approach here would be to use RNAseq-informed gene annotation if possible, along with some refinement of the Earl Grey annotations to remove potential host gene sequences. You could do this by BLASTing the consensus library against CDD (conserved domain database) or NCBI NR database and removing anything with very good matches to non-TE proteins that cover a reasonable proportion of the total consensus length, and that do not have good matches to proteins from RepBase domains (i.e TE derived domains).

I hope this helps!

lyy005 commented 1 day ago

Hi Toby, Thank you so much for the helpful suggestions!

I have three follow-up questions with regard to your suggestion: "You could do this by BLASTing the consensus library against CDD (conserved domain database) or NCBI NR database and removing anything with very good matches to non-TE proteins that cover a reasonable proportion of the total consensus length, and that do not have good matches to proteins from RepBase domains (i.e TE derived domains)."

  1. Is this file the consensus library? [speciesName]_summaryFiles/[speciesName]_combined_library.fasta?

  2. My goal is to annotate my genomes and minimize the amount of TEs in my annotations. After I remove the non-TE in the consensus library, what would be the best way to re-mask my genome with the updated consensus library? Would that be using the "-l == Starting consensus library for an inital mask (in fasta format)" option?

  3. Alternatively, can I just use Tantan to mask simple repeats and run Braker for annotation. Then BLAST the annotated protein-coding genes to RepBase to remove potential TE genes?

Thank you again for your help!

YY