Optimized bacterial database error

davidvilanova commented 7 years ago

Hi while running in default mode a set of proteins against optimzed bacteria (hmm) i get an error ... Reading idmap /home/david/work/sources/eggnog-mapper/data/hmmdb_levels/bact_50/bact_50.hmm.idmap 159207 names loaded Sequence mapping starts now! Processed queries:1927 total_time:942.721111059 rate:2.04 q/s refined hits not available for custom hmm databases. Reading HMM matches Functional annotation of refined hits starts now error

It seems it is trying to refine hits however those are not available for custom databases ?? The database i have used is the optimized bacteria dowloaded with the download script. The annotations file does not display any annotation ?? ...

davidvilanova commented 7 years ago

Could be a memory leak while unloading the db from memory ( i have used usemem flag) and using an SGE queing system

ealdraed commented 7 years ago

Hi @davidvilanova !

It would help if you could post the exact command you used to run emapper.py! You can of course obliterate sensitive file names if required.

It looks like you asked for a custom db although you say you ran the optimized bacterial db. Has your argument been -d bact for the database? If you ran -d bact_50, I think this could have caused the error. It "slips through" line 76 (https://github.com/jhcepas/eggnog-mapper/blob/c436da5779b333531038eacfaa0a0d4255696544/emapper.py#L76) but is not recognized in line 226 (https://github.com/jhcepas/eggnog-mapper/blob/c436da5779b333531038eacfaa0a0d4255696544/emapper.py#L226).

What version of the software did you run?

davidvilanova commented 7 years ago

I´m running emapper outside from its folder , the emapper.py file is in my PATH. I´m using the absolut path to the bact_50 optimized folder otherwise it cannot be found.

echo "emapper.py --database /home/david/work/sources/eggnog-mapper/data/hmmdb_levels/bact_50/bact_50.hmm --cpu 10 --usemem --output_dir test -o output_dir -i seq.faa " | qsub -pe parallel_smp 10 -l h_vmem=10G

davidvilanova commented 7 years ago

I have re-run this way which looks much better (on the cluster with 10 CPU allocated). It looks that this relies in python multiprocesses (file pool.py log below in error file). I´m running python 2.7.12. Maybe i do not get how to run it properly. Since i´m using a linux linux custer i use the SGE system (-pe parallel_smp 10 -l h_vmem=10G per core).

emapper.py --d bact --cpu 10 --usemem --output_dir outputdir -o out -i seq.faa

#  emapper-0.12.7-8-gc436da5
# ./emapper.py  -d bact -i seq.faa --output_dir outputdir -o out --cpu 10 --usemem
Loading server at localhost, port 51500-51501
Loading server at localhost, port 51500-51501
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Reading idmap /work/dvilanova/david/sources/eggnog-mapper/data/hmmdb_levels/bact_50/bact_50.hmm.idmap
159207 names loaded
Sequence mapping starts now!
 Processed queries:30 total_time:9.13830184937 rate:3.28 q/s
Hit refinement starts now

And the in log error file i get

26 7.50618433952 3.46 q/s
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Traceback (most recent call last):
  File "./emapper.py", line 1080, in <module>
    main(args)
  File "./emapper.py", line 227, in main
    refine_matches(args.input, seed_orthologs_file, hmm_hits_file, args)
  File "./emapper.py", line 510, in refine_matches
    base_tempdir=args.temp_dir)):
  File "./emapper.py", line 572, in process_nog_hits_file
    for r in pool.imap(search.refine_hit, cmds):
  File "/work/dvilanova/miniconda2/lib/python2.7/multiprocessing/pool.py", line 668, in next
    raise value
ValueError: Error running PHMMER
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):
thread creation failed
Fatal exception (source file ../../easel/esl_threads.c, line 129):

jhcepas commented 7 years ago

sounds like a deeper problem with multithreading python. Can you run basic multiprocessing script in your current setup?:

from multiprocessing import Pool, TimeoutError
import time
import os

def f(x):
    return x*x

if __name__ == '__main__':
    pool = Pool(processes=4)              # start 4 worker processes

    # print "[0, 1, 4,..., 81]"
    print pool.map(f, range(10))

    # print same numbers in arbitrary order
    for i in pool.imap_unordered(f, range(10)):
        print i

davidvilanova commented 7 years ago

The multiprocessing works as expected with no error.

echo "python test2.py" | qsub -o out -e err -pe parallel_smp 4
==> out <==
[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]
0
1
4
9
16
25
36
49
64
81

davidvilanova commented 7 years ago

Hi guys, Same command with cpu=1 did almost work `emapper.py --d bact --cpu 1 --usemem --output_dir outputdir -o out -i seq.faa

` STDOUT Log

# ./emapper.py  -d bact -i /home/dvilanova/work/WGS_PIPELINE/analysis_default_MOCK/seq.faa --override --output_dir outputdir -o out --cpu 1 --usemem
Loading server at localhost, port 51500-51501
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Waiting for server to become ready... localhost 51500
Reading idmap /work/dvilanova/david/sources/eggnog-mapper/data/hmmdb_levels/bact_50/bact_50.hmm.idmap
159207 names loaded
Sequence mapping starts now!
 Processed queries:30 total_time:82.1124022007 rate:0.37 q/s
Hit refinement starts now
 Processed queries:26 total_time:217.179458857 rate:0.12 q/s
Reading HMM matches
Functional annotation of refined hits starts now
 Processed queries:40 total_time:396.550137997 rate:0.10 q/s
Done
   out.emapper.hmm_hits
   out.emapper.seed_orthologs
   out.emapper.annotations
Total time: 734.632 secs

================================================================================
CITATION:
If you use this software, please cite:

[1] Fast genome-wide functional annotation through orthology assignment by
      eggNOG-mapper. Jaime Huerta-Cepas, Damian Szklarczyk, Lars Juhl Jensen,
      Christian von Mering and Peer Bork. Submitted (2016).

[2] eggNOG 4.5: a hierarchical orthology framework with improved functional
      annotations for eukaryotic, prokaryotic and viral sequences. Jaime
      Huerta-Cepas, Damian Szklarczyk, Kristoffer Forslund, Helen Cook, Davide
      Heller, Mathias C. Walter, Thomas Rattei, Daniel R. Mende, Shinichi
      Sunagawa, Michael Kuhn, Lars Juhl Jensen, Christian von Mering, and Peer
      Bork. Nucl. Acids Res. (04 January 2016) 44 (D1): D286-D293. doi:
      10.1093/nar/gkv1248

[3] Accelerated Profile HMM Searches. PLoS Comput. Biol. 7:e1002195. Eddy SR.
       2011.

(e.g. Functional annotation was performed using emapper-0.12.7-8-gc436da5 [1]
 based on eggNOG orthology data [2]. Sequence searches were performed
 using [3].)

================================================================================

STDERR (no real usefull information)

26 64.2112979889 0.40 q/s
26 216.679230928 0.12 q/s (refinement)
Your job has been killed (cluster message)

....

jhcepas commented 7 years ago

It seems that your cluster queue system killed the process, probably because it ran out of memory. Could you try if the same command runs well skipping the queue system?

davidvilanova commented 7 years ago

I can´t do it this way since i´m launching the jobs from a frontend with restricted usage. I have been using it for three years with different programs , cpus and memory settings. I have also adjusted memory setting trying with 2 cpu and 100G per core which should be enough for loading the complete bact database but it also failed.

davidvilanova commented 7 years ago

Have you tried with the SGE queing system ?

davidvilanova commented 7 years ago

I have run the analysis with the "-m diamond" tag and i did work perfectly with the cluster. I suspect the problem is related to threads when going through the hmm default pipeline.

jhcepas commented 7 years ago

@davidvilanova It seems to work with the SGE setup in our cluster... I used the following submission command:

qsub -pe smp 10 test_sge.sh

and the following job script:

$ cat test_sge.sh

eggnog-mapper/emapper.py -i eggnog-mapper/test/testCOG0515.fa -o test_polb -d bactNOG --cpu 10 --override

output

$ cat test_sge.sh.o
#  emapper-0.12.7-8-gc436da5
# ./emapper.py  -i eggnog-mapper/test/testCOG0515.fa -o test_polb -d bactNOG --cpu 10 --override
Sequence mapping starts now!
 Processed queries:5 total_time:165.61026597 rate:0.03 q/s
Hit refinement starts now
 Processed queries:5 total_time:21.5015897751 rate:0.23 q/s
Reading HMM matches
Functional annotation of refined hits starts now
 Processed queries:14 total_time:0.268085002899 rate:52.22 q/s
Done
   test_polb.emapper.hmm_hits
   test_polb.emapper.seed_orthologs
   test_polb.emapper.annotations
Total time: 187.896 secs

In any case, I would say the diamond mode is overall preferred. All benchmarks are showing same or even better results than using HMM. At least for genomes that are not extremely far from the species covered in eggnog4.5.

davidvilanova commented 7 years ago

Thanks for replicating. I´m using the optimized bacterial database -d bact although. I will stay with diamond Thanks, david

eggnogdb / eggnog-mapper

Optimized bacterial database error #33