emapper.py idle - Githubissues

algrgr commented 1 year ago

Hello,

I seem to have problem running emapper.py (from conda). I have downloaded databases using download_eggnog_data.py and accepting default downloads. And I have a proteome FASTA file. I'm trying this line: emapper.py -m diamond -i ~/proteome.fasta --output_dir ~/eggnog/ -o strain1 --cpu 40 --override

After I hit enter, I get this line in yellow color on the screen: ~/minoconda3/envs/eggnog/bin/diamond blastp -d '~/minoconda3/envs/eggnog/lib/python3.11/site-packages/data/eggnog_proteins.dmnd' -q '~/proteome.fasta' --threads 40 -o '~/eggnog/strain1.emapper.hits' --tmpdir '~/emappertmp_dmdn_mk3quh_g' --sensitive --iterate -e 0.001 --top 3 --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp

And nothing seems to happen then. The .hits file and temp folder are indeed created, but nothing writes to it. I've been waiting for an hour or so.

emapper.py -v emapper-2.1.10 / Expected eggNOG DB version: 5.0.2 / Installed eggNOG DB version: 5.0.2 / Diamond version found: diamond version 2.1.6 / MMseqs2 version found: 14.7e284

Am I missing something obvious?

thanks in advance, alex

Cantalapiedra commented 1 year ago

Hi @algrgr ,

How many proteins are in your proteome.fasta file? Maybe the issue is just that it didn't finish.

Best, Carlos

algrgr commented 1 year ago

Hello @Cantalapiedra ,

Thank you for your reply! Indeed, I sampled 43 sequences from the entire proteome (~11000) and it finished in about 1 hour. I guess I have to drastically lower my expectations on the timing of these runs. Although my questions still would be:

Do you think it is "normal" timing?
When submitting to web-server, the entire proteome was finished in ~30 min, why is that so quick? For some reason I always assume that web-based tools should be way slower compared to local runs.

Thanks for developing the tool and cheers, alex

Cantalapiedra commented 1 year ago

Hi @algrgr ,

Performance depends on many things, including hardware and data access and manipulation by the software. Depending on your hardware, 1 hour for 43 sequences could be entirely normal, given that you are running the whole standard emapper pipeline (diamond search + annotation).

Regarding the diamond search stage, note that this won't scale linearly, because diamond needs to set up things (e.g. loading the eggnog db into memory) which is required regardless of if you search 1 or 1 million sequences. Note, however, that diamond has some parameters to optimize searches for small queries (but this is no important here, I guess). So maybe the 40 threads doesn't make so much difference for a few queries, and will have more impact for the more queries you have. Also speed will depend on the specific CPUs you have (we note a lot of difference from different servers/clusters depending on the hardware they have).

Regarding the annotation stage, this can be very slow depending on where you store the eggnog.db (a sqlite3 DB which eggnog-mapper uses to transfer eggnog annotations to your queries). The use of CPUs here helps, but not so much if the bottleneck is in the read speed of the DB (which by default is read from disk). That is why there is a --dbmem option to load the annotation DB into memory, which will speed up annotations a lot. However, --dbmem requires ~44GB of RAM to be able to load the eggnog.db into memory.

I would say that the main difference with the web server is that we load the DBs into memory (although in this case it is not done using --dbmem, but the outcome is the same).

I am sure there will be other factors, but I am not an expert on performance.

I hope this is of help.

Best, Carlos

algrgr commented 1 year ago

@Cantalapiedra , thank you for replying so quickly and extensively! I should be able to reserve 64GB RAM for my sessions. I'll try --dbmem to see if that speeds things up.

Would it be an idea to implement a bit more verbose info during the run? Major stages like loading DB into memory and a counter of how many queries have been processed so far. This will keep the users more aware of what is going on at the moment. Just a suggestion.

cheers, alex

Cantalapiedra commented 1 year ago

Hi @algrgr ,

Using --dbmem will definitely help accelerating the annotation stage. Note that you could also separate both stages:

Search: adding --no_annot. It will produce "emapper.seed_orthologs" file.
Annotation: adding -m no_search --annotate_hits_table emapper.seed_orthologs. It will produce "emapper.annotations" file.

More info here https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10

Thank you for the suggestion. Yes, that is an aspect that we could improve. Note however that the processes running on external tools (like diamond) depend on their outputs, rather than ours.

Best, Carlos

algrgr commented 1 year ago

@Cantalapiedra Great! I actually just noticed in the output that annotation of those 43 proteins technically took just 35 seconds (out of 1 hour total):

## 43 queries scanned
## Total time (seconds): 34.86780405044556
## Rate: 1.23 q/s

So, 99% of time is handling the databases then?

cheers, alex

Cantalapiedra commented 1 year ago

Hi @algrgr ,

I would say that most of the time went to the diamond search, then. I am not sure, but maybe you can check timestamps in the .hits and .seed_orthologs files also.

eggnogdb / eggnog-mapper

emapper.py idle #456