WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
254 stars 52 forks source link

Problem during prepare the database #263

Open Gu2uMo opened 1 year ago

Gu2uMo commented 1 year ago

Hi, I'm trying the tools, but have a problem. I download the files needed and then run the commands, but it always stuck at the "populating the description db" step. And the last try had run for 17 hours. Do you have any suggestion ? Thanks ! Below is the command DRAM-setup.py prepare_databases --output_dir ./DRAM \ --kofam_hmm_loc profiles.tar.gz \ --kofam_ko_list_loc ko_list.gz \ --uniref_loc uniref90.fasta.gz \ --uniref_version 90 \ --pfam_loc Pfam-A.full.gz \ --pfam_hmm_loc Pfam-A.hmm.dat.gz \ --dbcan_loc dbCAN-HMMdb-V11.txt \ --dbcan_fam_activities CAZyDB.08062022.fam-activities.txt \ --dbcan_version 11 \ --vogdb_loc vog.hmm.tar.gz \ --vog_annotations vog_annotations_latest.tsv.gz \ --viral_loc viral.1.protein.faa.gz \ --peptidase_loc merops_peptidases_nr.faa \ --genome_summary_form_loc genome_summary_form.20230208.tsv \ --module_step_form_loc module_step_form.20230208.tsv \ --etc_module_database_loc etc_mdoule_database.20230208.tsv \ --function_heatmap_form_loc function_heatmap_form.20230208.tsv \ --amg_database_loc amg_database.20230208.tsv \ --keep_database_files \ --threads 64

the log looks as below

2023-02-09 15:24:39,703 - Database preparation started 2023-02-09 15:24:39,703 - Copying Pfam-A.hmm.dat.gz to output_dir 2023-02-09 15:24:39,713 - Downloading dbcan_fam_activities 2023-02-09 15:24:39,713 - Downloading dbCAN family activities from : https://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fam-activities.txt 2023-02-09 15:24:42,309 - Downloading dbcan_subfam_ec 2023-02-09 15:24:42,309 - Downloading dbCAN sub-family encumber from : https://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fam.subfam.ec.txt 2023-02-09 15:24:44,117 - Downloading vog_annotations 2023-02-09 15:24:49,083 - Copying genome_summary_form.20230208.tsv to output_dir 2023-02-09 15:24:49,102 - Copying module_step_form.20230208.tsv to output_dir 2023-02-09 15:24:49,119 - Copying function_heatmap_form.20230208.tsv to output_dir 2023-02-09 15:24:49,123 - Copying amg_database.20230208.tsv to output_dir 2023-02-09 15:24:49,130 - Copying etc_mdoule_database.20230208.tsv to output_dir 2023-02-09 15:24:49,131 - All raw data files were downloaded successfully 2023-02-09 15:24:49,131 - Processing kofam_hmm 2023-02-09 15:35:08,342 - KOfam database processed 2023-02-09 15:35:08,398 - Moved kofam_hmm to final destination, configuration updated 2023-02-09 15:35:08,398 - Processing kofam_ko_list 2023-02-09 15:35:08,471 - KOfam ko list processed 2023-02-09 15:35:08,472 - Moved kofam_ko_list to final destination, configuration updated 2023-02-09 15:35:08,472 - Processing uniref 2023-02-09 16:03:18,420 - UniRef database processed 2023-02-09 16:03:18,421 - Moved uniref to final destination, configuration updated 2023-02-09 16:03:18,421 - Processing pfam 2023-02-09 16:22:38,708 - PFAM database processed 2023-02-09 16:22:38,709 - Moved pfam to final destination, configuration updated 2023-02-09 16:22:38,710 - Moved pfam_hmm to final destination, configuration updated 2023-02-09 16:22:38,710 - Processing dbcan 2023-02-09 16:22:40,227 - dbCAN database processed 2023-02-09 16:22:40,228 - Moved dbcan to final destination, configuration updated 2023-02-09 16:22:40,228 - Processing viral 2023-02-09 16:22:45,552 - RefSeq viral database processed 2023-02-09 16:22:45,553 - Moved viral to final destination, configuration updated 2023-02-09 16:22:45,553 - Processing peptidase 2023-02-09 16:22:53,257 - MEROPS database processed 2023-02-09 16:22:53,259 - Moved peptidase to final destination, configuration updated 2023-02-09 16:22:53,259 - Processing vogdb 2023-02-09 16:27:14,481 - VOGdb database processed 2023-02-09 16:27:14,495 - Moved vogdb to final destination, configuration updated 2023-02-09 16:27:14,496 - Moved genome_summary_form to final destination, configuration updated 2023-02-09 16:27:14,496 - Moved module_step_form to final destination, configuration updated 2023-02-09 16:27:14,497 - Moved etc_module_database to final destination, configuration updated 2023-02-09 16:27:14,498 - Moved function_heatmap_form to final destination, configuration updated 2023-02-09 16:27:14,499 - Moved amg_database to final destination, configuration updated 2023-02-09 16:27:14,500 - Moved dbcan_fam_activities to final destination, configuration updated 2023-02-09 16:27:14,500 - Moved dbcan_subfam_ec to final destination, configuration updated 2023-02-09 16:27:14,503 - Moved vog_annotations to final destination, configuration updated 2023-02-09 16:27:14,503 - Populating the description db, this may take some time

rfour92 commented 1 year ago

I have similar issue while the setup script has been running for almost 4 days already and is stuck at populating the description db step. I am not sure if it is finished or not however, when I run print config, I get

2023-02-12 19:43:07,388 - Logging to console Processed search databases KEGG db: None KOfam db: /ibex/scratch/projects/c2189/DRAM_data/kofam_profiles.hmm KOfam KO list: /ibex/scratch/projects/c2189/DRAM_data/kofam_ko_list.tsv UniRef db: /ibex/scratch/projects/c2189/DRAM_data/uniref90.20230210.mmsdb Pfam db: /ibex/scratch/projects/c2189/DRAM_data/pfam.mmspro dbCAN db: /ibex/scratch/projects/c2189/DRAM_data/dbCAN-HMMdb-V11.txt RefSeq Viral db: /ibex/scratch/projects/c2189/DRAM_data/refseq_viral.20230210.mmsdb MEROPS peptidase db: /ibex/scratch/projects/c2189/DRAM_data/peptidases.20230210.mmsdb VOGDB db: /ibex/scratch/projects/c2189/DRAM_data/vog_latest_hmms.txt

Descriptions of search database entries Pfam hmm dat: /ibex/scratch/projects/c2189/DRAM_data/Pfam-A.hmm.dat.gz dbCAN family activities: /ibex/scratch/projects/c2189/DRAM_data/CAZyDB.08062022.fam-activities.txt VOG annotations: /ibex/scratch/projects/c2189/DRAM_data/vog_annotations_latest.tsv.gz

Description db: /ibex/scratch/projects/c2189/DRAM_data/description_db.sqlite

DRAM distillation sheets Genome summary form: /ibex/scratch/projects/c2189/DRAM_data/genome_summary_form.20230210.tsv Module step form: /ibex/scratch/projects/c2189/DRAM_data/module_step_form.20230210.tsv ETC module database: /ibex/scratch/projects/c2189/DRAM_data/etc_mdoule_database.20230210.tsv Function heatmap form: /ibex/scratch/projects/c2189/DRAM_data/function_heatmap_form.20230210.tsv AMG database: /ibex/scratch/projects/c2189/DRAM_data/amg_database.20230210.tsv

not sure if these means that it is setup or perhaps there is some sort of a glitch that prevented the setup from completion.

Gu2uMo commented 1 year ago

@rfour92 Hello, thanks for you reply. While I found the file "description_db.sqlite" is still updated by the DRAM setup script slowly. So that I thought it was not finished. BTW, do you try the annotation step ? I did not kill the setup and the database file was locked, so that I could not try at this time.

rfour92 commented 1 year ago

@Gu2uMo Hello sir, I just checked the description_db.sqlite as well and it is still updating too so I guess I will have similar respond. however, the job will be killed in less than three days so I hope it finishes well before that

alsmadin01 commented 1 year ago

Hi @Gu2uMo @rfour92 @rmFlynn , I have a similar issue – were you able to resolve it?

pooranis commented 1 year ago

Has anyone's build of the description db completed? Ours has been running for 48 hours. And the sqlite db is still growing slowly.

saras224 commented 1 year ago

What is the total size of the database downloaded?

mlhoggard commented 1 year ago

Just chiming in here as well to see if there had been any updates/progress on this?

We've been having the same issue for a while now. We've had previous installs of DRAM running fine, but over the last 6 months or so several attempts at upgrading (a fresh install of both DRAM and the databases) have all hung at this same step where the only thing updating is description_db.sqlite. The current attempt has been running 12 days (8 days now where only the .sqlite file has been updating), and we've had previous attempts fail after a few weeks stuck at this step.

Any info on if anyone's managed to get this to complete and/or if there's a fix in the works would be much appreciated thanks!

pooranis commented 1 year ago

Our database preparation completed after 3 days running on HPC using 10 processors and a maximum of 443 G of memory (!). The total size of the database directory is 686G. Hope this helps some here.

mlhoggard commented 1 year ago

Thanks for the reply. Our run finally looks like it has worked now. It required 12 days and max ~400 GB RAM. But our final database directory is 799 GB, so it looks like it's grown pretty substantially between when you last ran the database preparation and now, which might explain why it takes that much longer to run the description_db.sqlite update step now...

saras224 commented 1 year ago

Hi Guys!!! I am happy to inform you that after struggling for a long time I am finally able to run DRAM. My database installation completed in 3 days. I ran database setup script on HPC with 586GB RAM and 80 n tasks per node on smp. I am able to get the html version of the heatmaps but I want the better image quality, did anyone try that before?

Thanks in Advance!! Saras

mlhoggard commented 1 year ago

Hi @saras224,

Congrats on getting it up and running.

I haven't spent that much time with the distill output yet, but I'm guessing the figures generated are meant as a first-look guide, rather than final publication quality figures.

DRAM.py distill also outputs product.tsv though, which is the data plotted in the heatmaps in product.html. So you might want to bring that tsv file into something else that you normally use for plotting (e.g. R or python, or excel if you're more familiar with that) and make fresh high res plots in your preferred format.

Cheers, Mike.

nologo68 commented 1 year ago

what is the total, complete size of the description_db.sqlite file?

saras224 commented 1 year ago

what is the total, complete size of the description_db.sqlite file?

38 GB

alegarritano commented 11 months ago

In case anyone is wondering, the database preparation (kegg included) took 9h32 to complete. It used a peak memory of 982.3GB (out of 1.5TB) and I requested 80 CPUs for the MMSeqs step. Total size of the description_db file is ˜39.28GB. Good luck.

diego00012138 commented 5 months ago

@saras224 I am wondering if by chance you whould know how large it will be without uniref and kegg