WARNING: Number of good genes is low (30).

Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes

Other

348 stars 79 forks source link

WARNING: Number of good genes is low (30). #8

Closed wingwingWY closed 5 years ago

wingwingWY commented 5 years ago

I run braker2 with RNA-seq data for a green algae genome with low GC content. I get the final output with warning: WARNING: Number of good genes is low (30). Recommended are at least 600 genes

Then I run busco with eukarya database for augustus.hints.aa file generated by braker, the missing busco group is 87.5%.

How should I do for this genome?

KatharinaHoff commented 5 years ago

How did you run BRAKER?

The "number of good genes" is the number of genes that is "good" or suitable for training AUGUSTUS.

You ran BRAKER with some input data that was not sufficient for generating a large number of training genes. If you ran with RNA-Seq, check whether more RNA-Seq data is available. If you ran with proteins, check whether RNA-Seq data is available. In any case: you need more extrinsic data to generate better and more training genes.

Best,

Katharina

On Thu, Oct 18, 2018 at 12:11 PM wingwingWY notifications@github.com wrote:

I run braker2 with RNA-seq data for a green algae genome with low GC content. I get the final output with warning: WARNING: Number of good genes is low (30). Recommended are at least 600 genes

Then I run busco with eukarya database for augustus.hints.aa file generated by braker, the missing busco group is 87.5%.

How should I do for this genome?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Gaius-Augustus/BRAKER/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AlgvJKjsYOgKqG3CsA-orADYep_y8X05ks5umFPdgaJpZM4XtHBc .

wingwingWY commented 5 years ago

I use 6Gb RNA-seq data mapping to my genome by hisat2. I ran braker as follows: perl braker.pl --species=Test --genome=genome.fasta --bam=sample.bam --cores 14

I will try to run BRAKER2 with proteins of longer evolutionary distance. Can you tell me how to run GeneMark-EP specific protein mapping pipeline for generating hints file?

KatharinaHoff commented 5 years ago

Is your target species one that is expected to have genes with many introns? BRAKER is designed for species with genes that have introns.

Are you using a full genome? Using only a small proportion of the genome could also explain a very low number of training genes.

A protein mapping pipeline (beta stage, they are still testing it) is available at http://exon.gatech.edu/GeneMark/Braker/protein_mapping_pipeline.tar.gz .

wingwingWY commented 5 years ago

Thank you very much! I use the full genome sequnces to run braker. The average exon number is 4 based on hisat2+stringtie result. So most of genes have multiple exons.

Maybe the problem is the genetic code. I ran transdecoder to predict ORF for these transcripts generated by stringtie and found that the genetic code table should be set to 6 .

How to change the genetic code for braker?

KatharinaHoff commented 5 years ago

BRAKER can currently not run with a genetic code different from the standard, yet. There is an open issue to fix this: https://github.com/Gaius-Augustus/BRAKER/issues/3 but the timeline is yet unclear. From the AUGUSTUS point of view, this could be fixed, easily, but some code development is required on the GeneMark-ES/ET side.

KatharinaHoff commented 5 years ago

I am closing this issue because it will be fixed when issue 3 is solved.

mictadlo commented 4 years ago

Hi all, How to change TransDecoder parameter to make it compatible with BRAKER2? I run TransDecoder with StringTie output as described here.

Thank you in advance,

Michal

KatharinaHoff commented 4 years ago

TransDecoder finds coding regions in assembled transcripts and can output them as features on genomic loci in gff3 format.

BRAKER uses unassembled RNA-Seq reads to predict coding regions in genomic loci.

In a way both tools do the same in different ways. As developers of BRAKER, we believe that it is an advantage to use the unassembled RNA-Seq reads because this avoids the transfer of assembly errors (of transcripts) into genome annotation. It does therefore not make sense for us to write a transdecoder-output-to-hints parser for BRAKER.

You can, however, use transdecoder derived evidence with AUGUSTUS. Have a look at our book chapter at https://math-inf.uni-greifswald.de/storages/uni-greifswald/fakultaet/mnf/mathinf/stanke/augustus_wrp.pdf to learn more about using AUGUSTUS, format of hints files and extrinsic evidence configuration files.

mictadlo commented 4 years ago

Thank you for your explanation and the link to your book. If I understand it correctly, then your book describes all the manual steps to get an annotation with Augustus. However, all these steps BRAKER2 performs automatically, right?

I want to continue to use BRAKER2, and I wonder whether there is a parameter which permits to add transdecoder's cds output?

Using together transdecoder's cds and RNA-Seq data will I still get UTR included in BRAKER2 annotation?

Thank you in advance,

Michal

KatharinaHoff commented 4 years ago

It will cause problems if you simply mix transdecoder derived hints with RNA-Seq hints from unassembled reads with BRAKER. The properties of both hints are different, but in BRAKER, they will most likely both be treated as "expression hints" (this of course depends on how you format the hints file, but since BRAKER excludes a number of hints sources, source key E is likely, here). I therefore recommend that you run AUGUSTUS outside of BRAKER if you want to mix both hints types. You'll need to learn about hints format and how the extrinsic configuration file works.

mictadlo commented 4 years ago

Thank you for your explanation. To be on the safe side, I will stick only with both BRAKER and RNA-Seq.