Gaius-Augustus / BRAKER

BRAKER is a pipeline for fully automated prediction of protein coding gene structures with GeneMark-ES/ET/EP/ETP and AUGUSTUS in novel eukaryotic genomes
Other
343 stars 79 forks source link

Empty filtered Augustus gff #233

Closed tkjmk closed 4 years ago

tkjmk commented 4 years ago

Hey! Thank you for developing BRAKER2.

I am trying to annotate a genome de novo of a Drosophila species. I have:

  1. A softmasked genome assembly
  2. RNAseq evidence. Aligned to the genome using STAR's 2-pass mode.
  3. OrthoDB evidence. I downlaoded arthropoda odb10 fasta sequences.

I am essentially running test3, but with the datasets listed above.

I am using Braker v2.15

Everything was running successfully up until joingenes, where I got a segmentation fault.

The input files are:

I decided to look into it further. I noticed that the augustus.Ppri5.gtf_filtered was empty. This is one of the files that goes into joingenes.

I then looked into filter_augustus_gff.pl to see why it would be empty. I looked at how braker.pl uses this script, and tested it myself. I got the same results as the script. --in=augustus.Ppri5.gff --src=P which gives an output of 12266 (line 9709 in braker.pl) --in=augustus.Ppri5.gff --src=E which gives an output of 12457 (line 9730 in braker.pl)

but then for the step to generate augustus.Ppri5.gtf_filtered it uses the gtf not the gff it used above. This gives an empty file. Running it myself with no output:

--in=augustus.Ppri5.gtf --src=P gives an output of 0

I am opening this issue to ask if this is the expected behaviour of the script?

Everything seemed to run fine with the test data, so I am concerned if it is my data itself.

Otherwise, my questions are: For line 9761 / 9773 in braker.pl, is the .gtf file meant to be used (even though it is giving me empty files) or should it be the gff (which gives me non-empty files). This is for the lines using filter_augustus_gff.pl.

For checking supported transcripts for file2, I was wondering why the script uses file1 augustus.Ppri5.gtf, rather than file2 augustus.E.gff itself? When I run filter_augustus_gff.pl with : --in=augustus.E.gff --src=E it gives an output of 12429

I am currently re-running the software, I am not able to recreate the segmentation fault myself - so I will see if it happens again when re-running it. I am just concerned about how the commands in the braker.pl script and my data are behaving together.

tomasbruna commented 4 years ago

Hello @tkjmk,

there indeed seem to be errors, thanks for catching and reporting them.

It looks like that lines 9442 and 9454 in braker.pl should be using $gff_file, in the same way as it was fixed here https://github.com/Gaius-Augustus/BRAKER/commit/859ea1f5459b89be545fc42a632f64acb8c3e51b.

The other error probably originates from an incorrect assignment in the line 9409.

@KatharinaHoff, can you double-check if these are really errors? If so, I'll fix this and re-do related test files before creating the new BRAKER release.

Best, Tomas

tomasbruna commented 4 years ago

@KatharinaHoff, I fixed this issue in a new branch. Before merging it into master, I am evaluation how much it affects final prediction accuracy on 4 genomes (D. melanogaster, A. thaliana, C. elegans, and D. rerio)

tomasbruna commented 4 years ago

The effect of the fix on the results was quite small (within 0.3 percentage points in terms of gene accuracy) on the tested species (with species and order excluded proteins on input) but it's still good to have the error fixed. Thank you @tkjmk for reporting.