Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
121 stars 4 forks source link

GALBA is not compatible with GFF and FAA(FASTA) from NCBI Refseq #43

Closed jyw-atgithub closed 6 months ago

jyw-atgithub commented 7 months ago

Hello, I downloaded the Refseq annotations from NCBI (GCF_016920845.1) and used GALBA with Singularity. At the beginning, GALBA kept reporting an error as following.

# Wed Jan 17 22:02:58 2024: Checking if input file /home/jenyuw/Fish-project/reference/GCF_016920845/genomic.gff is in gff format
# Wed Jan 17 22:02:58 2024 ERROR: in file /opt/GALBA/scripts/galba.pl at line 2857
File /home/jenyuw/Fish-project/reference/GCF_016920845/genomic.gff is not in gff format at line 1!

After working a while, I found that GALBA does not accept any header lines or "#". Is this normal? I could not find the instruction in the manual.

Seconds, should I ignore or fix the following warnings?

# WARNING: Detected whitespace in fasta header of file /home/jenyuw/Fish-project/reference/GCF_016920845/protein.faa. This may later on cause problems! The pipeline will create a new file without spaces or "|" characters and a genome_header.map file to look up the old and new headers. This message will be suppressed from now on!
#*********
Warning: score for (g,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (g,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (g,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (g,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (c,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (c,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (v,u) is not defined in the matrix. Returning -4 instead.
Warning: score for (c,u) is not defined in the matrix. Returning -4 instead.
warning: Coverage appears to be high, --ignoreCoverage flag will be ignored

The command was: singularity exec /home/jenyuw/Software/galba.sif galba.pl --genome=${final_genome}/C01_final.fasta --species=Phytichthys_chirus --prot_seq=${ref}/GCF_016920845/protein.faa --hints=${ref}/GCF_016920845/genomic2.gff --workingdir=${annotation} --threads 30 --crf The environment is: singularity-ce version 3.11.0-jammy, GALBA v1.0.11, Ubuntu 22.04.2 LTS

Thank you!

KatharinaHoff commented 6 months ago

I would not expect GALBA to accept any gff with an annotation as input. GALBA is supposed to produce a new annotation (in gtf or gff3 format), not accept it as input. We often do not screen for hashtag lines, indeed (but I do not see how that is relevant because you should not run GALBA with gff input from NCBI). You may provide additional hints to AUGUSTUS in gff format, however, you need to postprocess the NCBI annotation for that, either way, because AUGUSTUS hints format is kind of its own gff fromat. It expects exact types in the third column, specific features in the last column, and the hints file must be compatible with the extrinsic.cfg file. I'd advise against it if you do not know what you're doing. You can read on hints files for Augustus e.g. in https://math-inf.uni-greifswald.de/storages/uni-greifswald/fakultaet/mnf/mathinf/stanke/augustus_wrp.pdf (or in the Augustus tutorials).

The whitespaces warning can be avoided by removing the whitespaces from the fasta headers, beforehand, but it's not dangerous.

Accuracy will be higher if you use more protein donors. The warning that coverage will be ignored basically says that you are using a not optimal input. If you have no more, use no more and ignore it.

I would advise against the crf flag unless you already have the hmm training and know how to compare the results to make a decision. crf is not always better and increases training time a lot.

The matrix warning affects only a few proteins in your data set. It is probably safe to ignore.