cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
137 stars 27 forks source link

microbeannotator hangs when parsing HMM profile #40

Open aalarkin opened 2 years ago

aalarkin commented 2 years ago

Hello,

I am trying to annotate a file of protein calls (i.e., a Prodigal metagenome output file) using microbeannotator. My file is 750M with 2,000,000 reads and an average sequence length of 275 bp. I ran the Anaconda install of microbeannotator on a high performance cluster with 1 node, 20 cores, and 4G memory per core using the following command:

microbeannotator -i protein_rep_seq.fasta -d MicrobeAnnotator_DB -o microbeannotator-out -m blast -p 1 -t 10 --refine

However, I manually killled the process after microbeannotator spent 20 hours on the "Parsing HMM profile metadata" step. Here is the command line output:

"2021-12-14 16:21:56,551 [INFO]: ---- This is MicrobeAnnotator v2.0.5 ---- 2021-12-14 16:21:56,551 [INFO]: Validating user inputs 2021-12-14 16:21:56,553 [INFO]: Passed 2021-12-14 16:21:56,553 [INFO]: Processing 1 files. I will run 1 files in parallel with 20 threads per file. 2021-12-14 16:22:02,046 [INFO]: Searching proteins against KOfam profiles 2021-12-14 16:22:02,047 [INFO]: Parsing HMM profile metadata slurmstepd: error: JOB 9359923 ON hpc3-22-03 CANCELLED AT 2021-12-15T12:51:18 "

Is this a typical amount of time for microbeannotator to run this process? And is there anything I can do to speed up this step? Preferrably, I would like to run the full version of microbeannotator rather than the light version, but that will not be possible at this computational speed.

Thank you for your assistance!

Alyse

rotheconrad commented 2 years ago

You can split up your larger protein fasta file and run multiple instances of microbe annotator. It may also be more efficient to use fewer threads per file and run more files in parallel such as 4 threads per file and 5 files in parallel.

Blast will also be very slow. I recommend using Diamond instead.

The number of reads nor the read length should matter for microbeannotator because it is working with the predicted protein sequences from Prodigal.

aalarkin commented 2 years ago

Hello, thank you for getting back to me so quickly!

In terms of splitting up the protein fasta file, do you have a recommended file size either in terms of MB or read count?

rotheconrad commented 2 years ago

I've only used microbe annotator on MAGs so far which are between 2000 - 5000 proteins usually. I think I'm only using 2 threads per genome and they take about 3 to 4 hours each. I'm getting ready to try it with some full metagenomes in the next couple of days though but also holiday vacation. If you don't figure it out I can let you know how I managed after the holidays.

Another thought is to do some clustering prior to annotation. You can use CD-HIT or mmseqs2 to cluster your protein sequences by similarity and then you could annotate the representatives.

aalarkin commented 2 years ago

Gotcha! Good luck clustering over the holidays!

This is actually already the representative protein sequences. I used USEARCH to get non-redundant reads from the Prodigal output and then mmseqs2 to cluster the non-redundant reads into representative sequences. It also only represents 4 out of about 300 metagenomes that I eventually plan to process. So optimizing computation time will be key. ;)

Happy to post an update if I get it going in a reasonable amount of time.