Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
121 stars 4 forks source link

Error during miniprothint stage #20

Closed ASLeonard closed 1 year ago

ASLeonard commented 1 year ago

Hi, I've been unsuccessful in running GALBA (using earlier versions and v1.0.3). Steps seem to go okay until after the miniprot --aln, at which point I get lots of errors like

error: Invalid alignment header 
error: Unexpected number of columns in the header

This is coming from miniprot-boundary-scorer (cc @tomasbruna), because the alignment file lines are not 19 columns (here). If this happens so regularly and is an allowed output of miniprot, maybe the errors should be suppressed in GALBA, as it doesn't seem to be an "error"?

However, it did seem to hit a real error during miniprothint when getting start codons. I guess the file was empty after grepping

error: exited due to an error in command: grep start_codon .../GALBA/miniprot.gff > .../GALBA/tmp/startsAllf4annt19.gff
ERROR in file /opt/GALBA/scripts/galba.pl at line 3506
failed to execute: /opt/miniprothint/miniprothint.py ...GALBA/miniprot.gff --workdir .../GALBA --ignoreCoverage!

I replaced the sys.exit call with a pass here and the command would finish (although the hc.gff file only had 323 lines, is that expected for a 3gb mammal genome?).

Thanks, Alex

tomasbruna commented 1 year ago

Hi @ASLeonard,

Thanks for reporting this issue. Most likely, all the problems with grep, etc. stem from the "Unexpected number of columns in the header" error. Can you please share a couple of proteins that produce this error? I'll patch the miniprot boundary scorer to fix this.

Thanks, Tomas

tomasbruna commented 1 year ago

In any case, PAF specification only mandates 12 columns in the header, so I changed the code in https://github.com/tomasbruna/miniprot-boundary-scorer/commit/25b92407b0f3b8035743d8009be826c7d92c0432 to check for that.

However, the fact that your output did not have 19 fields could be a sign of something else going wrong, so I'd still be interested in seeing your input. Thanks!

ASLeonard commented 1 year ago

I think it was primarily the 12 column output, which seems like the default for unmapped proteins (since the -u flag is used). The other issue for the start_codons seems to have gone away after deleting all files and downloading a clean set of proteins, so may have just been a partially corrupted file somewhere.

tomasbruna commented 1 year ago

Right, I forgot GALBA is using the -u flag. The alignment headers without subsequent alignments might be causing issues, let me fix that in the parser. Thanks!

tomasbruna commented 1 year ago

Fixed in https://github.com/tomasbruna/miniprot-boundary-scorer/commit/c4d300a53203ebf0aa1e8a60680a1e368f58dac9. Again, thanks for reporting.