eggnogdb / eggnog-mapper

Fast genome-wide functional annotation through orthology assignment
http://eggnog-mapper.embl.de
GNU Affero General Public License v3.0
562 stars 105 forks source link

Error running diamond: operable program or batch file. #450

Open leonhardt913 opened 1 year ago

leonhardt913 commented 1 year ago

Hi,

I am new to Eggnog-Mapper and barely have experience on using Python. Previously I attempted to install it on Windows computer but failed to run it for some problems that I had no idea how to fix. Then I used Webpage version of Eggnog-Mapper for the needs.

Now my protein fasta file is huge. I am now considering run Eggnog-mapper in my Windows PC to do the annotation.

The OS of my PC is Microsoft Windows [Version 10.0.19045.2728]

I installed python-3.8.8, but not for higher version, which encounters problem installing required biopython(v1.76) along with eggnog-mapper. Even I install higher version of biopython manually (should be v1.80 or v1.81), It attempts to remove higher version of biopython and tried install v1.76 biopython and eventually failed again. So anyway I found out old version Python worked for me, and installed it in CMD,

C:\Users\AA>pip install eggnog-mapper
Requirement already satisfied: eggnog-mapper in c:\users\AA\appdata\local\programs\python\python38\lib\site-packages (2.1.10)
Requirement already satisfied: biopython==1.76 in c:\users\AA\appdata\local\programs\python\python38\lib\site-packages (from eggnog-mapper) (1.76)
Requirement already satisfied: psutil==5.7.0 in c:\users\AA\appdata\local\programs\python\python38\lib\site-packages (from eggnog-mapper) (5.7.0)
Requirement already satisfied: xlsxwriter==1.4.3 in c:\users\AA\appdata\local\programs\python\python38\lib\site-packages (from eggnog-mapper) (1.4.3)
Requirement already satisfied: numpy in c:\users\AA\appdata\local\programs\python\python38\lib\site-packages (from biopython==1.76->eggnog-mapper) (1.24.2)
WARNING: You are using pip version 20.2.3; however, version 23.0.1 is available.
You should consider upgrading via the 'c:\users\AA\appdata\local\programs\python\python38\python.exe -m pip install --upgrade pip' command.

After installing, because other steps made me little confused, (not sure if it is the reason cause the error) I jumped to the eggNOG-mapper databases download in Setup section. I found a way to manually download the eggnog.db eggnog.taxa.tar eggnog_proteins.dmnd and put them in Python38\Lib\site-packages\data.

Then, I test the command emapper.py in CMD but it always open the emapper.py file using my default program opening .py files, even it does have #! line in emapper.py. Then I tried going to the scripts folder and run " python emapper.py " and it works.

C:\Users\AA\AppData\Local\Programs\Python\Python38\Scripts>python emapper.py
usage: emapper.py [-h] [-v] [--list_taxa] [--cpu NUM_CPU] [--mp_start_method {fork,spawn,forkserver}] [--resume]
                  [--override] [-i FASTA_FILE] [--itype {CDS,proteins,genome,metagenome}] [--translate]
                  [--annotate_hits_table SEED_ORTHOLOGS_FILE] [-c FILE] [--data_dir DIR]
                  [--genepred {search,prodigal}] [--trans_table TRANS_TABLE_CODE] [--training_genome FILE]
                  [--training_file FILE] [--allow_overlaps {none,strand,diff_frame,all}] [--overlap_tol FLOAT]
                  [-m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}] [--pident PIDENT] [--query_cover QUERY_COVER]
                  [--subject_cover SUBJECT_COVER] [--evalue EVALUE] [--score SCORE] [--dmnd_algo {auto,0,1,ctg}]
                  [--dmnd_db DMND_DB_FILE]
                  [--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}]
                  [--dmnd_iterate {yes,no}]
                  [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
                  [--dmnd_frameshift DMND_FRAMESHIFT] [--gapopen GAPOPEN] [--gapextend GAPEXTEND]
                  [--block_size BLOCK_SIZE] [--index_chunks CHUNKS] [--outfmt_short] [--dmnd_ignore_warnings]
                  [--mmseqs_db MMSEQS_DB_FILE] [--start_sens START_SENS] [--sens_steps SENS_STEPS]
                  [--final_sens FINAL_SENS] [--mmseqs_sub_mat SUBS_MATRIX] [-d HMMER_DB_PREFIX] [--servers_list FILE]
                  [--qtype {hmm,seq}] [--dbtype {hmmdb,seqdb}] [--usemem] [-p PORT] [--end_port PORT]
                  [--num_servers NUM_SERVERS] [--num_workers NUM_WORKERS] [--timeout_load_server TIMEOUT_LOAD_SERVER]
                  [--hmm_maxhits MAXHITS] [--report_no_hits] [--hmm_maxseqlen MAXSEQLEN] [--Z DB_SIZE] [--cut_ga]
                  [--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans] [--no_annot] [--dbmem]
                  [--seed_ortholog_evalue MIN_E-VALUE] [--seed_ortholog_score MIN_SCORE] [--tax_scope TAX_SCOPE]
                  [--tax_scope_mode TAX_SCOPE_MODE] [--target_orthologs {one2one,many2one,one2many,many2many,all}]
                  [--target_taxa LIST_OF_TAX_IDS] [--excluded_taxa LIST_OF_TAX_IDS] [--report_orthologs]
                  [--go_evidence {experimental,non-electronic,all}] [--pfam_realign {none,realign,denovo}] [--md5]
                  [--output FILE_PREFIX] [--output_dir DIR] [--scratch_dir DIR] [--temp_dir DIR] [--no_file_comments]
                  [--decorate_gff DECORATE_GFF] [--decorate_gff_ID_field DECORATE_GFF_ID_FIELD] [--excel]
emapper.py: error: An input fasta file is required (-i)

I simply put my test.fasta (which is small, about 2Mb) files in "Python38\Scripts" folder and test the command, but it quickly failed.

C:\Users\AA\AppData\Local\Programs\Python\Python38\Scripts>python emapper.py -i test.fasta -o result
#  emapper-2.1.10
# emapper.py  -i test.fasta -o result
  C:\Users\AA\AppData\Local\Programs\Python\Python38\lib\site-packages\eggnogmapper\bin\diamond blastp -d 'C:\Users\AA\AppData\Local\Programs\Python\Python38\lib\site-packages\data\eggnog_proteins.dmnd' -q 'C:\Users\AA\AppData\Local\Programs\Python\Python38\Scripts\test.fasta' --threads 1 -o 'C:\Users\AA\AppData\Local\Programs\Python\Python38\Scripts\result.emapper.hits' --tmpdir 'C:\Users\AA\AppData\Local\Programs\Python\Python38\Scripts\emappertmp_dmdn_47tyzvr_' --sensitive --iterate -e 0.001 --top 3  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp
Error running diamond: operable program or batch file.

Above are the steps I have done so far. Any helps will be appreciated.

Cantalapiedra commented 1 year ago

Hi @leonhardt913 ,

As I don't use Windows for running eggnog-mapper, my only advice regarding this would be to use Linux within Windows. I use Ubuntu, and it works very well for me. It is rather easy to install, for instance:

https://ubuntu.com/tutorials/install-ubuntu-on-wsl2-on-windows-10#1-overview

The specific error that you have it could be because eggnog-mapper is trying to use the bundled diamond program, which is very likely a compilation for linux. If you still want to use it from Windows, you may try installing a diamond version for Windows and add it to your environmental path.

Best, Carlos

leonhardt913 commented 1 year ago

@Cantalapiedra

Hi Carlos. Thanks for you advice. I will take a look at the turorials you send, and figure out installing python and eggnog-mapper in the Ubuntu.

It might be a huge work for me to figure out replace the bundled diamond program with Windows version without getting more errors. I am not sure if anyone else in this community have done this and able to provide me some advice. But for now I would rather try using the virtual Linux system since I heard many bioinformatic tools runs well in Ubuntu.

Best, Leo

Cantalapiedra commented 1 year ago

I hope that you can make it work! Good luck!

Of course, if you need any advice during the installation, don't hesitate to ask. Once that you are able to run Linux, I would advice you to follow this:

https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#user-content-Installation

Often, the easiest is to do it with conda or pip. Be sure to have them updated, so that you are able to install the latests versions of the software. Once you have done that, please follow this, to set up the databases, etc:

https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#user-content-Setup

I would begin by installing only the complete diamond database. Then test if the you able to obtain some annotations.

https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#basic-usage

Once you have it working, you may worry about other options, databases, recipes, etc.

Just my 2 cents.

Best, Carlos

leonhardt913 commented 1 year ago

Dear @Cantalapiedra ,

Thanks for your help, I have one question about the Linux OS I installed:

I updated the WSL2 and installed the Ubuntu by using the .appx file i downloaded before. (I changed the suffix to .zip, unzip it and installed it using the .exe inside, and setup my UNIX account as well)

My Ubuntu version is 20.04:

PS C:\windows\system32> wsl -l -v
  NAME            STATE           VERSION
* Ubuntu-20.04    Running         2

Then I type "wsl" to initiate Ubuntu, I used the following command for updating system which took me about 10 mins:

/mnt/c/windows/system32$ sudo apt-get -y update && sudo apt-get -y upgrade

During updates I see some pop-up lines with "python3", so I guess my Linux already has python3 installed:

Preparing to unpack .../120-python3-cryptography_2.8-3ubuntu0.1_amd64.deb ...
Unpacking python3-cryptography (2.8-3ubuntu0.1) over (2.8-3) ...
Preparing to unpack .../121-python3-jwt_1.7.1-2ubuntu2.1_all.deb ...
Unpacking python3-jwt (1.7.1-2ubuntu2.1) over (1.7.1-2ubuntu2) ...
Preparing to unpack .../122-python3-urllib3_1.25.8-2ubuntu0.2_all.deb ...
Unpacking python3-urllib3 (1.25.8-2ubuntu0.2) over (1.25.8-2) ...
Preparing to unpack .../123-python3-requests_2.22.0-2ubuntu1_all.deb ...
Unpacking python3-requests (2.22.0-2ubuntu1) over (2.22.0-2build1) ...

Then I found out pip has to be installed independently:

/mnt/c/windows/system32$ pip

Command 'pip' not found, but can be installed with:

sudo apt install python3-pip

/mnt/c/windows/system32$ sudo apt install python3-pip

My question is that do I have to update the python? I'm not sure if this python3 fits the requirements (python3.7 or higher says in the Wiki), so I can continue downloading the eggnog-mapper, biopython, etc.

Look forward to your reply, Leo

Cantalapiedra commented 1 year ago

Hi @leonhardt913 ,

You can check python version with python3 --version. You may upgrade python, for instance https://cloudbytes.dev/snippets/upgrade-python-to-latest-version-on-ubuntu-linux However, if you don't want to upgrade python, you may use an environment manager (e.g. conda). You may install a Miniconda, Miniforge, Minimamba, or similar, and then just conda install eggnog-mapper, which would install (hopefully) the correct python version (and other packages) for using eggnog-mapper.

leonhardt913 commented 1 year ago

Hi @Cantalapiedra ,

I pretty much setup everything but stuck at the first test run again.

I installed Miniconda3 and setup my environment named as "rna" , and upgraded my python to 3.11 as well as installed the eggnog-mapper.

My eggnog-path is :

/home/leo913/miniconda3/envs/rna/lib/python3.11/site-packages/eggnogmapper

so I mimic the first step of setting up the PATH (I am not sure what it's for, and if the following code is wrong I guess it might be the reason my running is stuck? please correct me if it's wrong):

export PATH=/home/leo913/miniconda3/envs/rna/lib/python3.11/site-packages/eggnogmapper:/home/leo913/miniconda3/envs/rna/lib/python3.11/site-packages/eggnogmapper/bin:"$PATH"

Then I set up the dir for downloading diamond database and successfully downloaded them, but I later move them under the /python3.11/site-packages/data/

At this moment I am able to call out emapper.py in any folder:

leo913@DESKTOP-56F5NN6:/mnt/c/windows/system32$ cd ~
leo913@DESKTOP-56F5NN6:~$ ls
miniconda3
leo913@DESKTOP-56F5NN6:~$ mkdir eggnog-mapper-workplace
leo913@DESKTOP-56F5NN6:~$ cd eggnog-mapper-workplace
leo913@DESKTOP-56F5NN6:~/eggnog-mapper-workplace$ conda activate rna
(rna) leo913@DESKTOP-56F5NN6:~/eggnog-mapper-workplace$ emapper.py -i
usage: emapper.py [-h] [-v] [--list_taxa] [--cpu NUM_CPU] [--mp_start_method {fork,spawn,forkserver}] [--resume]
                  [--override] [-i FASTA_FILE] [--itype {CDS,proteins,genome,metagenome}] [--translate]
                  [--annotate_hits_table SEED_ORTHOLOGS_FILE] [-c FILE] [--data_dir DIR]
                  [--genepred {search,prodigal}] [--trans_table TRANS_TABLE_CODE] [--training_genome FILE]
                  [--training_file FILE] [--allow_overlaps {none,strand,diff_frame,all}] [--overlap_tol FLOAT]
                  [-m {diamond,mmseqs,hmmer,no_search,cache,novel_fams}] [--pident PIDENT] [--query_cover QUERY_COVER]
                  [--subject_cover SUBJECT_COVER] [--evalue EVALUE] [--score SCORE] [--dmnd_algo {auto,0,1,ctg}]
                  [--dmnd_db DMND_DB_FILE]
                  [--sensmode {default,fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}]
                  [--dmnd_iterate {yes,no}]
                  [--matrix {BLOSUM62,BLOSUM90,BLOSUM80,BLOSUM50,BLOSUM45,PAM250,PAM70,PAM30}]
                  [--dmnd_frameshift DMND_FRAMESHIFT] [--gapopen GAPOPEN] [--gapextend GAPEXTEND]
                  [--block_size BLOCK_SIZE] [--index_chunks CHUNKS] [--outfmt_short] [--dmnd_ignore_warnings]
                  [--mmseqs_db MMSEQS_DB_FILE] [--start_sens START_SENS] [--sens_steps SENS_STEPS]
                  [--final_sens FINAL_SENS] [--mmseqs_sub_mat SUBS_MATRIX] [-d HMMER_DB_PREFIX] [--servers_list FILE]
                  [--qtype {hmm,seq}] [--dbtype {hmmdb,seqdb}] [--usemem] [-p PORT] [--end_port PORT]
                  [--num_servers NUM_SERVERS] [--num_workers NUM_WORKERS] [--timeout_load_server TIMEOUT_LOAD_SERVER]
                  [--hmm_maxhits MAXHITS] [--report_no_hits] [--hmm_maxseqlen MAXSEQLEN] [--Z DB_SIZE] [--cut_ga]
                  [--clean_overlaps none|all|clans|hmmsearch_all|hmmsearch_clans] [--no_annot] [--dbmem]
                  [--seed_ortholog_evalue MIN_E-VALUE] [--seed_ortholog_score MIN_SCORE] [--tax_scope TAX_SCOPE]
                  [--tax_scope_mode TAX_SCOPE_MODE] [--target_orthologs {one2one,many2one,one2many,many2many,all}]
                  [--target_taxa LIST_OF_TAX_IDS] [--excluded_taxa LIST_OF_TAX_IDS] [--report_orthologs]
                  [--go_evidence {experimental,non-electronic,all}] [--pfam_realign {none,realign,denovo}] [--md5]
                  [--output FILE_PREFIX] [--output_dir DIR] [--scratch_dir DIR] [--temp_dir DIR] [--no_file_comments]
                  [--decorate_gff DECORATE_GFF] [--decorate_gff_ID_field DECORATE_GFF_ID_FIELD] [--excel]
emapper.py: error: argument -i: expected one argument

Then i used Windows File explorer to copy test.fasta into Linux system under /eggnog-mapper-workplace/ And then try to run the emapper.py:

(rna) leo913@DESKTOP-56F5NN6:~/eggnog-mapper-workplace$ ls
test.fasta
(rna)leo913@DESKTOP-56F5NN6:~/eggnog-mapper-workplace$ emapper.py -i test.fasta -o result1
#  emapper-2.1.10
# emapper.py  -i test.fasta -o result1
  /home/leo913/miniconda3/envs/rna/bin/diamond blastp -d '/home/leo913/miniconda3/envs/rna/lib/python3.11/site-packages/data/eggnog_proteins.dmnd' -q '/home/leo913/eggnog-mapper-workplace/test.fasta' --threads 1 -o '/home/leo913/eggnog-mapper-workplace/result1.emapper.hits' --tmpdir '/home/leo913/eggnog-mapper-workplace/emappertmp_dmdn_z5vu_inu' --sensitive --iterate -e 0.001 --top 3  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp

After that there is no more response from the emapper or the Powershell even I can type things on the pointer, I have to close the Powershell forcely. The test.fasta I used only contains about 100 proteins. And I look into the /eggnog-mapper-workplace/ with Windows file exploere and found emapper did created "result1.emapper.hits" file with 0 KB, and a new folder "emappertmp_dmdn_z5vu_inu" with nothing inside.

Not sure where did I do wrong. (maybe the PATH settings I mentioned above?)

Look forward to the helps. Leo

Cantalapiedra commented 1 year ago

Hi @leonhardt913 ,

How long was the last command running? It just seems that it didn't finish? I am not sure. You may try with an even smaller test fasta file (1 sequence, for instance) at least for the test.

leonhardt913 commented 1 year ago

I will try running it again with smaller test file next week since I am out of office. I will let you know the result.

leonhardt913 commented 1 year ago

Hi @Cantalapiedra ,

I used fasta file with 1 protein sequence and successfully got my result.

(rna) leo913@DESKTOP-56F5NN6:~/eggnog-mapper-workplace$ emapper.py -i test.fasta -o result1
#  emapper-2.1.10
# emapper.py  -i test.fasta -o result1
  /home/leo913/miniconda3/envs/rna/bin/diamond blastp -d '/home/leo913/miniconda3/envs/rna/lib/python3.11/site-packages/data/eggnog_proteins.dmnd' -q '/home/leo913/eggnog-mapper-workplace/test.fasta' --threads 1 -o '/home/leo913/eggnog-mapper-workplace/result1.emapper.hits' --tmpdir '/home/leo913/eggnog-mapper-workplace/emappertmp_dmdn_94hr8z8v' --sensitive --iterate -e 0.001 --top 3  --outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore qcovhsp scovhsp
Functional annotation of hits...
1 1.0553483963012695 0.95 q/s (% mem usage: 3.20, % mem avail: 96.82)
Done
Result files:
   /home/leo913/eggnog-mapper-workplace/result1.emapper.hits
   /home/leo913/eggnog-mapper-workplace/result1.emapper.seed_orthologs
   /home/leo913/eggnog-mapper-workplace/result1.emapper.annotations

================================================================================
CITATION:
If you use this software, please cite:

[1] eggNOG-mapper v2: functional annotation, orthology assignments, and domain
      prediction at the metagenomic scale. Carlos P. Cantalapiedra,
      Ana Hernandez-Plaza, Ivica Letunic, Peer Bork, Jaime Huerta-Cepas. 2021.
      Molecular Biology and Evolution, msab293, https://doi.org/10.1093/molbev/msab293

[2] eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated
      orthology resource based on 5090 organisms and 2502 viruses. Jaime
      Huerta-Cepas, Damian Szklarczyk, Davide Heller, Ana Hernandez-Plaza,
      Sofia K Forslund, Helen Cook, Daniel R Mende, Ivica Letunic, Thomas
      Rattei, Lars J Jensen, Christian von Mering and Peer Bork. Nucleic Acids
      Research, Volume 47, Issue D1, 8 January 2019, Pages D309-D314,
      https://doi.org/10.1093/nar/gky1085

[3] Sensitive protein alignments at tree-of-life scale using DIAMOND.
       Buchfink B, Reuter K, Drost HG. 2021.
       Nature Methods 18, 366–368 (2021). https://doi.org/10.1038/s41592-021-01101-x

e.g. Functional annotation was performed using eggNOG-mapper (version emapper-2.1.10) [1]
 based on eggNOG orthology data [2]. Sequence searches were performed using [3].

================================================================================

Total hits processed: 1
Total time: 2329 secs
FINISHED

I am very appreciated that my first test run was completed under your helps.

As it shows, it took me about 2329 secs (almost 40 mins) to finish the job for 1 protein, I guess probably the hardware of regular desktop PC has its limits, but still I wonder if there is anyway speeding up the annotation, or any configuration of Ubuntu (or WLS2) should be changed?, since the coming fasta files could contain 100,000+ proteins.

But it made me confused that in the middle of outcome result, it shows:

Functional annotation of hits...
1 1.0553483963012695 0.95 q/s (% mem usage: 3.20, % mem avail: 96.82)
Done

Which is way more different to "2329 secs" showed in the bottom. It seems that my PC is able to annotate faster but it took 40mins to produce the result. Do you have any idea about it?

Best regards, Leo

PS: I started another run of annotation with about 100 proteins in test2.fasta, will see how long it takes.

Cantalapiedra commented 1 year ago

Hi @leonhardt913 ,

Glad that it worked. Probably the first job, with 100 proteins, just didn't finish yet.

The q/s that you see corresponds to the annotation stage only. I guess that the rest of the 2329 seconds went to the diamond search. Note that diamond scales very well for large queries, but by default is not the faster for small queries.

Of course, depending on your hardware there are ways to speed up things. For instance, when using diamond for small queries you may use the --dmnd_algo ctg option. See diamond options at https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#diamond-search-options

Also the number of threads that you use, --cpu, has a large impact. See https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#execution-options

Note that for a large number of queries, the stage that is usually slower is the annotation stage. If you have enough memory you may accelerate this with --dbmem. See https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#user-content-Other_Requirements and https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.10#user-content-Annotation_Options

There are many options, and depending on your data and hardware you may want to use them or not.

I hope this is of help.

Best, Carlos

leonhardt913 commented 1 year ago

Hi @Cantalapiedra ,

Thanks for the tips!

The 100 proteins tests run was interupted due to the Windows update in midnight I guess. However, I used my 200k-proteins fasta file with additional option " --cpu 8 ". and the annotation took about 6 hours, which is totally acceptable for me.

Total hits processed: 199714
Total time: 19692 secs
FINISHED

Again thanks for your help!

Best, Leo

Cantalapiedra commented 1 year ago

Glad to be of help!