Gaius-Augustus / Augustus

Genome annotation with AUGUSTUS
http://bioinf.uni-greifswald.de/webaugustus/
287 stars 110 forks source link

Augustus PPX mode: augustInvalid block no. error #333

Open photocyte opened 2 years ago

photocyte commented 2 years ago

Hi there,

I've made a small prototype Nextflow pipeline to run Augustus in PPX mode with custom protein profiles, via Docker or Singularity: https://github.com/photocyte/luciferase-PPX-predictor-nf

( Augustus Docker image from quay.io/biocontainers/augustus:3.4.0--pl5321hd8b735c_3 , see https://quay.io/repository/biocontainers/augustus?tab=tags )

I think I was able to make a good .prfl file from a custom MSA in FASTA format, but, Augustus errors out when I try to use it:

genome_fasta=Ilumi1.3-grep13255.fasta 
prfl_file=elateroidea_luciferase_clade.msa.fa.prfl
augustus --species=fly --proteinprofile=${prfl_file} ${genome_fasta}
Command output:
  # This output was generated with AUGUSTUS (version 3.4.0).
  # AUGUSTUS is a gene prediction tool written by M. Stanke (mario.stanke@uni-greifswald.de),
  # O. Keller, S. König, L. Gerischer, L. Romoth and Katharina Hoff.
  # Please cite: Mario Stanke, Mark Diekhans, Robert Baertsch, David Haussler (2008),
  # Using native and syntenically mapped cDNA alignments to improve de novo gene finding
  # Bioinformatics 24: 637-644, doi 10.1093/bioinformatics/btn013
  # No extrinsic information on sequences given.
  # Sources of extrinsic information: M RM 
  # Initializing the parameters using config directory /usr/local/config/ ...
  # Using protein profile unknown
  # --[0..54]--> unknown_A (134) <--[2..4]--> unknown_B (168) <--[5..32]--> unknown_C (71) <--[0..7]--> unknown_D (91) <--[6..12]--
  # fly version. Using default transition matrix.
  # Looks like Ilumi1.3-grep13255.fasta is in fasta format.
  # We have hints for 0 sequences and for 0 of the sequences in the input set.
  #
  # ----- prediction on sequence number 1 (length = 366736, name = Ilumi1.3_Scaffold13255) -----
  #
  # Predicted genes for sequence number 1 on both strands

Command error:

  augustus: ERROR
  augustInvalid block no. in SubstateModel::blockNoOfB

The elateroidea_luciferase_clade.msa.fa.prfl file was made with this command:

msa_fasta=elateroidea_luciferase_clade.msa.fa
msa2prfl.pl --qij=/usr/local/config/profile/default.qij --prefix_from_seqnames --max_entropy=0.75 \
 ${msa_fasta} > ${msa_fasta}.prfl

Is there something wrong with the .prfl file I am creating? Relevant files attatched: Archive.zip

LarsGab commented 2 years ago

Hi,

I was able to reproduce your error and your protein profile looks fine to me. The issue seems to be related to the UTR prediction of Augustus in combination with the protein profile mode. I recommend turning off UTR prediction, e.g.:

genome_fasta=Ilumi1.3-grep13255.fasta 
prfl_file=elateroidea_luciferase_clade.msa.fa.prfl
augustus --species=fly --proteinprofile=${prfl_file} --UTR=off ${genome_fasta}

If the UTRs are important to you, I can take a closer look at the code causing this bug, but this may take some time.

Best, Lars

photocyte commented 2 years ago

Thank you! I can confirm adding --UTR=off is a workaround. I imagine the UTR training might be limited to highly curated models like --species=fly , but shutting it off explicitly seems like good practice if there are some unexpected interactions.