Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
363 stars 81 forks source link

Braker with RNA-seq: soft- vs hardmasking #188

Closed FabianDK closed 2 years ago

FabianDK commented 4 years ago

Dear authors,

I am trying to use Braker on RNA-seq data, and I have a question about your recommendation to use a softmasked genome.

In your tutorial for Augustus (https://github.com/Gaius-Augustus/Augustus/blob/master/docs/tutorial2018/index.html) you mapped RNA-seq reads with STAR against the hardmasked genome, but then used the softmasked reference version for Augustus and Braker.

Should the same be done when using Braker2? If yes, can you please explain what the advantage and reason is doing it this way over only using the softmasked reference throughout (i.e. mapping with STAR + BRAKER)?

I am using RepeatModeler2 to identify repeats, RepeatMasker for masking, and STAR to align paired-end RNA-seq reads.

Many thanks, Daniel

KatharinaHoff commented 2 years ago

It remains a valid question.

In practice, we often map RNA-Seq data against the softmasked genome, using aligners that ignore soft masking. This generates evidence for introns in softmasked regions. Generally, this is not a problem. For example, AUGUSTUS will usually not initiate a gene structure in a fully repeat masked region. It may initiate in a neighboring unmasked region and extend into the masked region using the evidence, and that's ok.

This answer does explicitly not apply to the integration of long read data or assembled short read data with TSEBRA.