Masked vs. Unmasked Genome Question

MattHuff commented 1 year ago

I am working on a project that utilizes BRAKER to annotate recently assembled genomes for wild plant species. During the course of the process, we became concerned that stringent RepeatMasking may result in key genes - particularly Resistance genes - becoming lost due to the masking process. As a way to test this, I ran our BRAKER pipeline twice - once using the masked reference genome, and again using the unmasked reference genome. Our pipeline follows the protocol where BRAKER is run twice in this pipeline - once using RNASeq evidence, and again using protein evidence - followed by TSEBRA to merge our two BRAKER outputs.

Our expectation is that, because we are not masking any regions of the genome, BRAKER would call more genes in the unmasked genome than in the masked genome. Our individual BRAKER results match this expectation, as shown below. Where things come into question is after merging the individual BRAKER results with TSEBRA; the unmasked genome has fewer genes compared to the masked genome. Is this expected behavior from TSEBRA, given that there are more genes to be called overlapping? I am running BRAKER version 2.1.6 and the most recent version of TSEBRA.

BRAKER gene counts - Masked genome

RNASeq Evidence - 27,855 Genes
Protein Evidence - 30,775 Genes
Combined - 27,608 Genes

BRAKER gene counts - Unmasked genome

RNASeq Evidence - 28,806 Genes
Protein Evidence - 53,832 Genes
Combined - 22,835 Genes

Output of TSEBRA when running on the unmasked genome (seems to be running without issue):

### READING GENE PREDICTION: [Primary-RNA.braker.gtf]
### READING GENE PREDICTION: [Primary-prot.braker.gtf]
### READING EXTRINSIC EVIDENCE: [../Primary_RNA/braker/hintsfile.gff]
### READING EXTRINSIC EVIDENCE: [../Primary_protein/braker/hintsfile.gff]
### BUILD OVERLAP GRAPH
### ADD FEATURES TO TRANSCRIPTS
### SELECT TRANSCRIPTS
### WRITE COMBINED GENE PREDICTION
### FINISHED

### The combined gene prediciton is located at unmasked_primary_braker_combined.gtf.

smallfishcui commented 1 year ago

I found the same issue. Even masking some highly reptative regions, in which case RNAseq has extrmely high coverage, would help with improving the annotation. hopefully someone can answer this question

SchwarzEM commented 1 year ago

One suggestion: try running TSEBRA with --keep-gtf (i.e., -k) arguments rather than --gtf (-g) arguments. This should give you more retained genes in the merger process (though at the cost of more predicted isoforms). Conceivably running TSEBRA this would reverse the current paradoxical (and, it sounds like, unbiological) results that you are seeing.

KatharinaHoff commented 1 year ago

When you run BRAKER on an unmasked genome, you will get a lot of wrong predictions. That's why softmasking is highly recommended.

When running TSEBRA as you did, only the predictions that have evidence are retained in the final gene set.

The proportion of genes with evidence in the predictions on the softmasked genome is larger, than on the unmasked genome. Thus, you see these numbers.

On Thu, Dec 8, 2022 at 9:50 PM MattHuff @.***> wrote:

I am working on a project that utilizes BRAKER to annotate recently assembled genomes for wild plant species. During the course of the process, we became concerned that stringent RepeatMasking may result in key genes - particularly Resistance genes - becoming lost due to the masking process. As a way to test this, I ran our BRAKER pipeline twice - once using the masked reference genome, and again using the unmasked reference genome. Our pipeline follows the protocol where BRAKER is run twice in this pipeline - once using RNASeq evidence, and again using protein evidence - followed by TSEBRA to merge our two BRAKER outputs.

Our expectation is that, because we are not masking any regions of the genome, BRAKER would call more genes in the unmasked genome than in the masked genome. Our individual BRAKER results match this expectation, as shown below. Where things come into question is after merging the individual BRAKER results with TSEBRA; the unmasked genome has fewer genes compared to the masked genome. Is this expected behavior from TSEBRA, given that there are more genes to be called overlapping? I am running BRAKER version 2.1.6 and the most recent version of TSEBRA.

BRAKER gene counts - Masked genome

RNASeq Evidence - 27,855 Genes

Protein Evidence - 30,775 Genes

Combined - 27,608 Genes

BRAKER gene counts - Unmasked genome

RNASeq Evidence - 28,806 Genes

Protein Evidence - 53,832 Genes

Combined - 22,835 Genes

Output of TSEBRA when running on the unmasked genome (seems to be running without issue):

READING GENE PREDICTION: [Primary-RNA.braker.gtf]

READING GENE PREDICTION: [Primary-prot.braker.gtf]

READING EXTRINSIC EVIDENCE: [../Primary_RNA/braker/hintsfile.gff]

READING EXTRINSIC EVIDENCE: [../Primary_protein/braker/hintsfile.gff]

BUILD OVERLAP GRAPH

ADD FEATURES TO TRANSCRIPTS

SELECT TRANSCRIPTS

WRITE COMBINED GENE PREDICTION

FINISHED

The combined gene prediciton is located at unmasked_primary_braker_combined.gtf.

— Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/556, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJMC6JC2PDGLCZF42V4WWZTWMJCYLANCNFSM6AAAAAASYSN4LE . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Gaius-Augustus / BRAKER

Masked vs. Unmasked Genome Question #556

READING GENE PREDICTION: [Primary-RNA.braker.gtf]

READING GENE PREDICTION: [Primary-prot.braker.gtf]

READING EXTRINSIC EVIDENCE: [../Primary_RNA/braker/hintsfile.gff]

READING EXTRINSIC EVIDENCE: [../Primary_protein/braker/hintsfile.gff]

BUILD OVERLAP GRAPH

ADD FEATURES TO TRANSCRIPTS

SELECT TRANSCRIPTS

WRITE COMBINED GENE PREDICTION

FINISHED

The combined gene prediciton is located at unmasked_primary_braker_combined.gtf.