Gaius-Augustus / GALBA

GALBA is a pipeline for fully automated prediction of protein coding gene structures with AUGUSTUS in novel eukaryotic genomes for the scenario where high quality proteins from one or several closely related species are available.
Other
121 stars 4 forks source link

gbFilterEtraining error #33

Closed marcelauliano closed 11 months ago

marcelauliano commented 1 year ago

Hi all, thank you for GALBA. I've been running it with the singularity, it worked perfectly for the test.

I have a mammalian genome without RNA-seq. I'm running it with the metazoa proteins from orthoDB. It all ran ok until the etraining step. It stop at:

/usr/bin//etraining --species=1tamandua --AUGUSTUS_CONFIG_PATH=/.augustus GALBA/train.gb 1> /GALBA/gbFilterEtraining.stdout 2>/GALBA/errors/gbFilterEtraining.stderr

With this error message inside errors:

GBProcessor::getGeneList(): Could not read the following line in Genbank file.
gatgtctgct
Maximum line length is 
39998

Encountered error after reading 55613 annotations.
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
Aborted

I question if this is really a memory issue, as my slurm scripts shows it did not reach the peak memory I requested for the run. Probably the issue is more related to the "GBProcessor::getGeneList(): Could not read the following line in Genbank file. gatgtctgct" What do you think?

I have in my directory a train.gb file (3.3Gb) and other outputs such as hintsfile.gff, traingenes.gff and others.

My question now is what I do? Do you have any idea what is going on? And if I solve it, is there a way I can re-start the pipeline from where it stopped?

thank you so much!

KatharinaHoff commented 1 year ago

The train.gb file sounds oddly large.

I am wondering why you would annotate a mammalian genome with the metazoa OrthoDB partition. GALBA is not made for running with OrthoDB, and metazoa is not the closest partition to mammals. You probably have weird alignments due to the large phylogenetic distance.

I recommend that you use a different protein input. For mammals, one can e.g. use the proteins of human, mouse, rat, chimp. There are plenty of high quality mammalian proteomes at NCBI Genomes (now moved to Taxonomy). Pick 4 - 10 such proteomes, and re-run the job.

Best,

Katharina

KatharinaHoff commented 11 months ago

I still think that it is a bad idea to annotate a mammal with metazoa ODB clade. However, the latest release of GALBA should make this approach a little bit safer. I will close here, because the original issue was caused by an unsuitable input.