WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
239 stars 50 forks source link

database_handler.py:51: UserWarning: Database does not exist at path None #212

Closed cmkobel closed 1 year ago

cmkobel commented 1 year ago

Hello

I just installed DRAM with conda on a fresh miniconda3 install on two independent HPC's

After running the DRAM-setup.py prepare_databases --output_dir DRAM_data step, I get the following error on both systems.

/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:51: UserWarning: Database does not exist at path None
  warnings.warn('Database does not exist at path %s' % self.description_loc)

I'm not sure how to fix this error as the all databases seem to exist when checking with DRAM-setup.py print_config

Here is the full log of prepare_databases and print_config log.txt

How do I go about fixing this?

cmkobel commented 1 year ago

When I run DRAM on a simple E. faecium strain 116 isolate genome, I get the following KeyError when DRAM processes peptidase hit descriptions.

0:13:54.990374: Getting descriptions of hits from peptidase
/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:81: UserWarning: No descriptions were found for your id's. Does this MER0389353 look like an id from peptidase_description
  warnings.warn("No descriptions were found for your id's. Does this %s look like an id from %s" % (list(ids)[0],
Traceback (most recent call last):
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/bin/DRAM.py", line 189, in <module>
    args.func(**args_dict)
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1040, in annotate_bins_cmd
    annotate_bins(list(set(fasta_locs)), output_dir, min_contig_size, prodigal_mode, trans_table, bit_score_threshold,
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1079, in annotate_bins
    all_annotations = annotate_fastas(fasta_locs, output_dir, db_handler, min_contig_size, prodigal_mode, trans_table,
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1013, in annotate_fastas
    annotate_fasta(fasta_loc, fasta_name, fasta_dir, db_handler, min_contig_size, prodigal_mode, trans_table,
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 921, in annotate_fasta
    annotations = annotate_orfs(gene_faa, db_handler, tmp_dir, start_time, custom_db_locs, custom_hmm_locs,
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 821, in annotate_orfs
    annotation_list.append(do_blast_style_search(query_db, db_handler.db_locs['peptidase'], tmp_dir,
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 684, in do_blast_style_search
    hits = formater(hits, header_dict)
  File "/cluster/work/users/cmkobel/miniconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 187, in get_peptidase_description
    header = header_dict[peptidase_hit]
KeyError: 'MER0389353'

Though, I'm not sure whether this is related to the initial mag_annotator issue.

rmFlynn commented 1 year ago

Sometimes the setup process exits early during the update descriptions step DRAM-setup.py update_description_db will complete the process. It happening on both systems is odd and something I will look into.

cmkobel commented 1 year ago

I'm having trouble allocating enough ram to run update_description. Is it correct that more than 500 GB is needed?


Regarding the machines:

Both systems were CentOS Red Hat with GCC 4.8.5-44:

cmkobel@fe-open-01:~$ cat /proc/version
Linux version 3.10.0-1160.53.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC)
cmkobel@login-5 ~ $ cat /proc/version
Linux version 3.10.0-1160.62.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) 

So that might be a confounding factor.

rmFlynn commented 1 year ago

Yes it's possible, it's a big problem that takes way too much memory. You might want to skip uniref it gets bigger all the time. Instructions are in the readme to do so, possibly there has been a increase in size. I may need to look into it, there's been a lot of problems like this lately

cmkobel commented 1 year ago

OK. The machines (head nodes) are both limited to some 377GB ram. I wonder what algorithm update_description uses? Does it really need to load the full uniref into ram at once, or could we make an implementation that works on subset chunks instead?

rmFlynn commented 1 year ago

Probably not, it's just putting it into an SQL database, I think it's probably needless. Although, it could be a quirk of the format of MMseqs files it's been on my to-do list forever. I think it's not going to even get done this month or next.

Sautumn-Lin commented 1 year ago

hello, can you help me to solve this problem? I have downloaded the dram-data in the /scratch/PI/boqianpy/App/DRAM_data/, but I don't have the old dram. which command I should use to setup?

rmFlynn commented 1 year ago

So you want to run DRAM-setup.py prepare_databases but skip downloading the databases because they are already downloaded? You will need to run DRAM-setup.py prepare_databases --help to see the arguments and then make a long command pointing to each file with the --<name>_loc arguments

rmFlynn commented 1 year ago

Here is an example:

DRAM-setup.py prepare_databases --output_dir download_test \
   --kegg_loc KEGG_LOC /my/path/database_files/kegg-all-orgs_unique_reheader.pep"         //# KEGG protein file, should be a single .pep, please merge all KEGG pep files (default: None)
   --threads 30                                                                           //# Number of threads to use building mmseqs2 databases (default: 10)
   --kofam_hmm_loc /my/path/database_files/kofam_profiles.tar.gz                          //# hmm file for KOfam (profiles.tar.gz) (default: None)
   --kofam_ko_list_loc /my/path/database_files/kofam_ko_list.tsv.gz                       //# KOfam ko list file (ko_list.gz) (default: None)
   --uniref_loc /my/path/database_files/uniref90.fasta.gz                                 //# File path to uniref, if already downloaded (uniref90.fasta.gz) (default: None)
   --pfam_loc /my/path/database_files/Pfam-A.full.gz                                      //# File path to pfam-A full file, if already downloaded (Pfam-A.full.gz) (default: None)
   --pfam_hmm_dat /my/path/Pfam-A.hmm.dat.gz                                              //# pfam hmm .dat file to get PF descriptions, if already downloaded (Pfam-A.hmm.dat.gz) (default: None)
   --dbcan_loc /my/path/database_files/CAMPER_v1.0.0-beta.1.tar.gz                        //# File path to dbCAN, if already downloaded (dbCAN-HMMdb-V9.txt) (default: None)
   --dbcan_fam_activities /my/path/CAZyDB.07292021.fam-activities.txt                     //# CAZY family activities file, if already downloaded (CAZyDB.07302020.fam-activities.txt) (default: None)
   --dbcan_sub_fam_activities /my/path/CAZyDB.07292021.fam.subfam.ec.txt                  //# CAZY subfamily activities file, if already downloaded (CAZyDB.07292021.fam.subfam.ec.txt) (default: None)
   --vogdb_loc /my/path/database_files/vog.hmm.tar.gz                                     //# hmm file for vogdb, if already downloaded (vog.hmm.tar.gz) (default: None)
   --vog_annotations /my/path/vog_annotations_latest.tsv.gz                               //# vogdb annotations file, if already downloaded (vog.annotations.tsv.gz) (default: None)
   --camper_tar_gz_loc /my/path/database_files/CAMPER_v1.0.0-beta.1.tar.gz                //# 
   --viral_loc /my/path/database_files/viral.merged.protein.faa.gz                        //# File path to merged viral protein faa, if already downloaded (viral.x.protein.faa.gz) (default: None)
   --peptidase_loc /my/path/database_files/merops_peptidases_nr.faa                       //# File path to MEROPS peptidase fasta, if already downloaded (pepunit.lib) (default: None)
   --genome_summary_form_loc /my/path/database_files/genome_summary_form.20220504.tsv     //# File path to genome summary form,if already downloaded (default: None)
   --module_step_form_loc /my/path/database_files/module_step_form.20220504.tsv           //# File path to module step form, ifalready downloaded (default: None)
   --etc_module_database_loc /my/path/database_files/etc_mdoule_database.20220504.tsv     //# File path to etc module database, if already downloaded (default: None)
   --function_heatmap_form_loc /my/path/database_files/function_heatmap_form.20220504.tsv //# File path to function heatmap form, if already downloaded (default: None)
   --amg_database_loc /my/path/database_files/amg_database.20220504.tsv                     # File path to amg database, if already downloaded (default: None)
Sautumn-Lin commented 1 year ago

thank you for your answers, I finished the setup.

SF-Dragon commented 1 year ago

--_loc arguments didn't work well. If I tried to use already downloaded db, I got following errors.

DRAM-setup.py prepare_databases --output_dir DRAMdata --kegg_loc /Users/DRAMdatabases/kegg_all.pep --threads 30 /Users/opt/anaconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:103: UserWarning: Database does not exist at path None warnings.warn('Database does not exist at path %s' % description_loc)

How do I resolve this issue?

rmFlynn commented 1 year ago

Have you upgraded to the latest DRAM? I ask because I thoght this was fixed, let me know what the output of DRAM-setup.py print_config and DRAM-setup.py version

rmFlynn commented 1 year ago

On second thoght @SF-Dragon, the output you gave only contains a warning, where are the errors? This warning should have been expected, and should not have stopped the setup.

SF-Dragon commented 1 year ago

Thank you very much for your help. I used the lates version of DRAM ver.1.4.0. The setup.py have been stopped at downloading vogdb so I manually downloaded it.

Then I used -loc option to skip downloading already existing databases, but I was not able to finish the setup with following messages. The output of print_config is attached. setup.config1.txt

DRAM-setup.py prepare_databases --output_dir DRAMdata --kegg_loc /Users/DRAMdatabases/kegg_all.pep --threads 30 \ --kofam_hmm_loc /Users/DRAMdatabases/kofam_profiles.tar.gz \ --kofam_ko_list_loc /Users/DRAMdatabases/kofam_ko_list.tsv.gz \ --uniref_loc /Users/DRAMdatabases/uniref90.fasta.gz \ --pfam_loc /Users/DRAMdatabases/Pfam-A.full.gz \ --pfam_hmm_loc /Users/DRAMdatabases/Pfam-A.hmm.dat.gz \ --dbcan_loc /Users/DRAMdatabases/dbCAN-HMMdb-V11.txt \ --dbcan_fam_activities /Users/DRAMdatabases/CAZyDB.08062022.fam-activities.txt \ --vogdb_loc /Users/DRAMdatabases/vog.hmm.tar.gz \ --vog_annotations /Users/DRAMdatabases/vog.annotations.tsv.gz /Users/opt/anaconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:103: UserWarning: Database does not exist at path None warnings.warn('Database does not exist at path %s' % description_loc) 2022-12-02 10:11:16,885 - Starting the process of downloading data Traceback (most recent call last): File "/Users/opt/anaconda3/envs/DRAM/bin/DRAM-setup.py", line 184, in <module> args.func(**args_dict) File "/Users/opt/anaconda3/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 540, in prepare_databases raise ValueError(f"The fallowing user provided paths don't seem to exist: {missing_user_inputs}") ValueError: The fallowing user provided paths don't seem to exist: ['kegg', 'kofam_hmm', 'kofam_ko_list', 'uniref', 'pfam', 'pfam_hmm', 'dbcan', 'vogdb']

rmFlynn commented 1 year ago

Thanks for all the details you provided I was able to find a rather obnoxious bug that has now been fixed. I haven't been able to check conda today but it should have gone through their database and you should simply be able to update dram. You will probably still get the warning but not the error and everything should be working, let me know if you have more problems and I'll address them quickly.

SF-Dragon commented 1 year ago

It works now! Thank you very much.