Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
121 stars 4 forks source link

Training file does not contain any training genes #29

Closed mabh5 closed 1 year ago

mabh5 commented 1 year ago

Hello,

I am running into issues attempting to run GALBA with protein sequences. I have attempted a few variations of the command line prompt and have gotten different errors but they all seem to have to do with the creation of training genes. The first attempt looked like this:

galba.pl --prg=gth --genome=corrected_scaffoldsmasked.fasta --prot_seq=rna.fasta

which got me the following error:

Training gene file in genbank format /mnt/e/GTCCSCombined/ragtag_output/Annotation/GALBA/train.ff.gb does not contain any training genes. At this stage, we performed a filtering step that discarded all genes that lead to etraining errors. If you lost all training genes, now, that means you probably have an extremely fragmented assembly where all training genes are incomplete, or similar.

I know that I do not have fragmented assembly as I have been doing some annotation by hand and there has been no issue finding the genes or with identifying the intros, exons, or UTRs. In addition the number of scaffolds I have is under 500 and the BUSCO results show an over 95% completeness.

I next attempted to add in species information which got me this error:

ERROR in file /home/mhudgell/GALBA/scripts/galba.pl at line 3926 Failed to execute: /usr/bin/etraining --species=Og --AUGUSTUS_CONFIG_PATH=/home/mh/.augustus /mnt/e/GTCCSCombined/ragtag_output/Annotation/GALBA/train.gb 1> /mnt/e/GTCCSCombined/ragtag_output/Annotation/GALBA/gbFilterEtraining.stdout 2>/mnt/e/GilaTroutCCSCombined/ragtag_outputArley/Annotation/GALBA/errors/gbFilterEtraining.stderr (GALBA) mh@DESKTOP-6JI8A13:/mnt/e/GilaTroutCCSCombined/ragtag_outputArley/Annotation$

When I looked into the GALBA.txt file to see what the issue might be I also found the following: /mnt/e/GTCCSCombined/ragtag_output/Annotation/GALBA/train.gb contains 1 genes.

It is worth noting the the rna.fasta file I am using is the rna sequences downloaded from the annotated genome of a very closely related species. the file itself has over 1 million sequences. I am unsure why, when GALBA is getting to this training point, it says that there is only 1 gene.

Help is appreciated!

KatharinaHoff commented 1 year ago

Please run GALBA mit miniprot instead of GenomeThreader. It is so much more accurate.

(We are not bugfixing GenomeThreader related issues, we are not testing whether current functions all work with GenomeThreader. If you have to use GenomeThreader, go back to the oldest version of GALBA, it's the one that likely works best with GenomeThreader.)

tomasbruna commented 1 year ago

Hi @mabh5,

It is worth noting the the rna.fasta file I am using is the rna sequences downloaded from the annotated genome of a very closely related species. the file itself has over 1 million sequences.

Just to make sure -- are these rna sequences translated to proteins?

mabh5 commented 1 year ago

Hi both!

thanks for the responses. as per @KatharinaHoff's suggestion I am attempting to run this again using miniprot. I am also running it using the OrthoDB vertebrate sequences rather then the closely related species just to see if i can get this to run at all. It is currently going so I will let you know if I continue to get errors.

@tomasbruna the rna.fasta file is nucleotide sequences. I do have the translated amino acid sequences on hand though, is that what I should be using for this? I also wanted to ask: does this pipeline require more then one sequence per gene? because that could be another issue I am running into with the sequences I am using.

tomasbruna commented 1 year ago

Yes, please use the amino acid protein sequences, GALBA does not work with nucleotide sequences on input.

The expected input is a proteome of one or more genomes - one sequence per gene is fine.

Please take a look at our recently published preprint https://www.biorxiv.org/content/10.1101/2023.04.10.536199v1, it should answer your questions in detail

KatharinaHoff commented 1 year ago

As already pointed out, GALBA at this point in time does not work with transcriptome data. Please use BRAKER if you need to process transcriptome data.