WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
239 stars 50 forks source link

DRAM issue with setting up databases #196

Closed yugen-miyahara closed 1 year ago

yugen-miyahara commented 1 year ago

Hi there,

I originally ran DRAM-v for annotations and then tried to run the distillation. But I only got outputs in 1/3 of the distillation output files and I couldn't get a heatmap. Trying to fix it I tried updating DRAM to the new version but then I lost the databases. After a lot of errors of trying to follow multiple other issues and telling DRAM where my database files are I decided to just try redownloading the databases. Sorry I feel like I'm going to be asking multiple other questions.

Now I have the same problem as in #issue189: DRAM-setup.py prepare_databases --output_dir /Volumes/Yugen_HD/DRAM_data --pfam_loc /Volumes/Yugen_HD/DRAM_data/Pfam-A.full.gz --pfam_hmm_dat /Volumes/Yugen_HD/DRAM_data/Pfam-A.hmm.dat.gz 2022-08-04 12:02:28.699808: Database preparation started Downloading dbCAN family activities from : https://bcb.unl.edu/dbCAN2/download/Databases/V10/CAZyDB.07292021.fam-activities.txt Downloading dbCAN from: http://bcb.unl.edu/dbCAN2/download/dbCAN-HMMdb-V10.txt 0:00:32.365817: dbCAN database processed 7:24:02.188046: UniRef database processed 16:50:05.288329: PFAM database processed Traceback (most recent call last): File "/Volumes/Yugen_HD/envs/DRAM/bin/DRAM-setup.py", line 158, in args.func(**args_dict) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 317, in prepare_databases output_dbs['viral_db_loc'] = download_and_process_viral_refseq(viral_loc, temporary, threads=threads, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 163, in download_and_process_viral_refseq download_file(refseq_url, refseq_faa, verbose=verbose) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/utils.py", line 16, in download_file run_process(['wget', '-O', output_file, url], verbose=verbose) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/utils.py", line 27, in run_process return subprocess.run(command, check=check, shell=shell, stdout=subprocess.PIPE, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['wget', '-O', '/Volumes/Yugen_HD/DRAM_data/database_files/viral.1.protein.faa.gz', 'ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz']' died with <Signals.SIGABRT: 6>.

I tried downloading all of the database files as from your links through googlechrome or with the wget command but for the "kofam_ko_list, from: ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz" and "viral from: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.%s.protein.faa.gz" I require to type in a username and password.

Is there another way to download those files?

Cheers, Yugen

rmFlynn commented 1 year ago

Hmm, it looks like you can't get the files because of ftp problems, that is common enough. Although, a password prompt is unheard of. Something is blocking you from within your network, perhaps? In any case, I will just attach the files you need: viral.4.protein.faa.gz ko_list.gz .

yugen-miyahara commented 1 year ago

Thank you!

As in #issue189 I used the "DRAM-setup.py prepare_databases --help" and tried to setup the databases again. But I have got this error.

(DRAM) yugenuni@iMac-4 Yugen_HD % DRAM-setup.py prepare_databases --output_dir /Volumes/Yugen_HD/DRAM_data --pfam_loc /Volumes/Yugen_HD/DRAM_data/Pfam-A.full.gz --pfam_hmm_dat /Volumes/Yugen_HD/DRAM_data/Pfam-A.hmm.dat.gz --kofam_ko_list_loc /Volumes/Yugen_HD/DRAM_data/ko_list.gz --peptidase_loc /Volumes/Yugen_HD/DRAM_data/pepunit.lib --kofam_hmm_loc /Volumes/Yugen_HD/DRAM_data/profiles.tar.gz --uniref_loc /Volumes/Yugen_HD/DRAM_data/database_files/uniref90.fasta.gz --dbcan_loc /Volumes/Yugen_HD/DRAM_data/dbCAN-HMMdb-V10.txt --dbcan_fam_activities /Volumes/Yugen_HD/DRAM_data/CAZyDB.07292021.fam-activities.txt --dbcan_version 10 --viral_loc /Volumes/Yugen_HD/DRAM_data/database_files/viral.4.protein.faa.gz
2022-08-06 14:50:27.742421: Database preparation started Traceback (most recent call last): File "/Volumes/Yugen_HD/envs/DRAM/bin/DRAM-setup.py", line 158, in args.func(**args_dict) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 288, in prepare_databases mkdir(temporary) FileExistsError: [Errno 17] File exists: '/Volumes/Yugen_HD/DRAM_data/database_files'

Do I need to set up the databases again like this or just import a text file with the locations of the databases? Something I have just tried is putting the two database files that weren't able to be downloaded into a separate folder and specifying their database location when setting up the databases so I don't get the error that database_files exists? Just waititng for the rest of the databases to be downloaded.

This is my DRAM-setup.py print_config: /Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:51: UserWarning: Database does not exist at path None warnings.warn('Database does not exist at path %s' % self.description_loc) Processed search databases KEGG db: None KOfam db: None KOfam KO list: None UniRef db: None Pfam db: None dbCAN db: None RefSeq Viral db: None MEROPS peptidase db: None VOGDB db: None

rmFlynn commented 1 year ago

You just need to delete or move the old output file from your failed setup /Volumes/Yugen_HD/DRAM_data DRAM will not overwrite an existing folder or file

yugen-miyahara commented 1 year ago

I have another error where viral.1.protein.faa.gz couldn't be downloaded properly.

(DRAM) yugenuni@iMac-4 ~ % DRAM-setup.py prepare_databases --verbose --keep_database_files --output_dir /Volumes/Yugen_HD/DRAM_data --uniref_loc /Volumes/Yugen_HD/DRAM_data/uniref90.fasta.gz --pfam_loc /Volumes/Yugen_HD/DRAM_data/Pfam-A.full.gz --pfam_hmm_dat /Volumes/Yugen_HD/DRAM_data/Pfam-A.hmm.dat.gz --kofam_hmm_loc /Volumes/Yugen_HD/DRAM_data/profiles.tar.gz --dbcan_loc /Volumes/Yugen_HD/DRAM_data/dbCAN-HMMdb-V10.txt --dbcan_fam_activities /Volumes/Yugen_HD/DRAM_data/CAZyDB.07292021.fam-activities.txt --vogdb_loc /Volumes/Yugen_HD/DRAM_data/vog.hmm.tar.gz --vog_annotations /Volumes/Yugen_HD/DRAM_data/vog.annotations.tsv.gz --peptidase_loc /Volumes/Yugen_HD/DRAM_data/pepunit.lib --genome_summary_form_loc /Volumes/Yugen_HD/DRAM_data/genome_summary_form.tsv --module_step_form_loc /Volumes/Yugen_HD/DRAM_data/module_step_form.tsv --etc_module_database_loc /Volumes/Yugen_HD/DRAM_data/etc_module_database.tsv --function_heatmap_form_loc /Volumes/Yugen_HD/DRAM_data/function_heatmap_form.tsv 2022-08-08 11:41:04.520782: Database preparation started 0:00:33.625230: dbCAN database processed 6:27:27.480869: UniRef database processed 16:51:31.044183: PFAM database processed downloading ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz dyld[44001]: missing symbol called Traceback (most recent call last): File "/Volumes/Yugen_HD/envs/DRAM/bin/DRAM-setup.py", line 158, in args.func(**args_dict) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 317, in prepare_databases output_dbs['viral_db_loc'] = download_and_process_viral_refseq(viral_loc, temporary, threads=threads, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 163, in download_and_process_viral_refseq download_file(refseq_url, refseq_faa, verbose=verbose) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/utils.py", line 16, in download_file run_process(['wget', '-O', output_file, url], verbose=verbose) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/utils.py", line 27, in run_process return subprocess.run(command, check=check, shell=shell, stdout=subprocess.PIPE, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['wget', '-O', '/Volumes/Yugen_HD/DRAM_data/database_files/viral.1.protein.faa.gz', 'ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.1.protein.faa.gz']' died with <Signals.SIGABRT: 6>.

I already have the viral.1.protein.faa.gz file downloaded, so I just specified where it is now with --viral_loc and now deleted my "database_files" folder and now have to download the 700GB again. Is there a way to specify where all the files in the database_files folder are rather than deleting the folder and having to download another 700GB again? You did mention above we have to delete or move the output files. Seems like there are no commands in the "prepare_databases -h" to specify any of these files. Or is this done with set_database_locations?

rmFlynn commented 1 year ago

It should be possible to specify every file in the database folder rather than download them. Sadly, and this is very sad there is no all data file. If -h didn't show you the solution I think you might need to type --help you will end up with one very long command regrettably. But in your situation it might be the only way.

rmFlynn commented 1 year ago

I would consider fixing this honestly but I have two weeks of business trips and some very big projects to do in the meantime. I'm sorry this Will not get as much attention. Tomorrow morning, I'll try to get you an example command with all files specified, I have one sitting around somewhere.

yugen-miyahara commented 1 year ago

That's no problem. I really appreciate the quick responses.

yugen-miyahara commented 1 year ago

It got the furthest it's been so far.

/Volumes/Yugen_HD/DRAM_data/ko_list already exists -- do you wish to overwrite (y or n)? n not overwriting 21:49:07.901437: KOfam ko list processed 21:49:07.901548: PFAM hmm dat processed 21:49:07.901562: dbCAN fam activities processed 21:49:07.901598: VOGdb annotations processed downloading https://raw.githubusercontent.com/shafferm/DRAM/master/data/amg_database.tsv --2022-08-10 12:13:14-- https://raw.githubusercontent.com/shafferm/DRAM/master/data/amg_database.tsv Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ... Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 21569 (21K) [text/plain] Saving to: ‘/Volumes/Yugen_HD/DRAM_data/database_files/amg_database.20220810.tsv’

/Volumes/Yugen_HD/DRAM_da 100%[=====================================>] 21.06K --.-KB/s in 0s

2022-08-10 12:13:50 (62.3 MB/s) - ‘/Volumes/Yugen_HD/DRAM_data/database_files/amg_database.20220810.tsv’ saved [21569/21569]

21:49:46.581508: DRAM databases and forms downloaded 21:49:47.039261: Files moved to final destination /Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:51: UserWarning: Database does not exist at path None warnings.warn('Database does not exist at path %s' % self.description_loc) Traceback (most recent call last): File "/Volumes/Yugen_HD/envs/DRAM/bin/DRAM-setup.py", line 158, in args.func(args_dict) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 375, in prepare_databases db_handler.set_database_paths(output_dbs) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 104, in set_database_paths self.db_locs['kofam_ko_list'] = check_exists_and_add_to_location_dict(kofam_ko_list_loc, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 101, in check_exists_and_add_to_location_dict raise ValueError("Database location does not exist: %s" % loc) ValueError: Database location does not exist: /Volumes/Yugen_HD/DRAM_data/kofam_ko_list.tsv

I should've clicked yes. I'll have to try again with the prepare_databases. I still only get the same instructions with --help and -h. Is the database_files folder specified by "--keep_database_files"?

I was also wondering when it's preparing the databases in the database_files folder, is it downloading from the internet or extracting from the database files I have already specified?

yugen-miyahara commented 1 year ago

For the kofam_ko_list.tsv error where it can't find the database file, I tried to do what alisDRI did in Issue#157 but it didn't work.

yugen-miyahara commented 1 year ago

I figured out I just need to have the database input files in another folder. The link to the "etc_module_database.tsv" file is not working. Is it possible if someone could please upload it here so I can put it in the databases folder?

Cheers, Yugen

yugen-miyahara commented 1 year ago

I was able to get the file. The database setup has run up to the point but failed as in #133 where uniref databases are incorrect.

error: 21:12:44.261656: Files moved to final destination /Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:51: UserWarning: Database does not exist at path /Volumes/Yugen_HD/DRAM_databases/description_db.sqlite warnings.warn('Database does not exist at path %s' % self.description_loc) Traceback (most recent call last): File "/Volumes/Yugen_HD/envs/DRAM/bin/DRAM-setup.py", line 158, in args.func(**args_dict) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 374, in prepare_databases db_handler.populate_description_db(output_dbs['description_db_loc'], update_config=False) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 235, in populate_description_db self.add_descriptions_to_database(self.make_header_dict_from_mmseqs_db(self.db_locs['uniref']) , File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py", line 155, in make_header_dict_from_mmseqs_db mmseqs_headers_handle = open('%s_h' % mmseqs_db, 'rb') FileNotFoundError: [Errno 2] No such file or directory: '/Volumes/Yugen_HD/DRAM_databases/uniref90.20220823.mmsdb_h' (it should be 20220824)

I did what you said to do in issue 133 where I changed the database name to the correct date and reimported the config file. I was unsure what I need to do next. After I reimported the config file and tried to run annotate I got an error likely because the databases aren't fully setup, the error is: 0:01:09.443686: Getting forward best hits from viral Traceback (most recent call last): File "/Volumes/Yugen_HD/envs/DRAM/bin/DRAM-v.py", line 153, in args.func(**args_dict) File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_vgfs.py", line 475, in annotate_vgfs annotations = annotate_fastas(contig_locs, output_dir, db_handler, min_contig_size, prodigal_mode, trans_table, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 1013, in annotate_fastas annotate_fasta(fasta_loc, fasta_name, fasta_dir, db_handler, min_contig_size, prodigal_mode, trans_table, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 921, in annotate_fasta annotations = annotate_orfs(gene_faa, db_handler, tmp_dir, start_time, custom_db_locs, custom_hmm_locs, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 814, in annotate_orfs annotation_list.append(do_blast_style_search(query_db, db_handler.db_locs['viral'], tmp_dir, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 671, in do_blast_style_search forward_hits = get_best_hits(query_db, target_db, working_dir, 'gene', db_name, bit_score_threshold, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/annotate_bins.py", line 67, in get_best_hits run_process(['mmseqs', 'search', query_db, target_db, query_target_db, tmp_dir, '--threads', str(threads)], File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/site-packages/mag_annotator/utils.py", line 27, in run_process return subprocess.run(command, check=check, shell=shell, stdout=subprocess.PIPE, File "/Volumes/Yugen_HD/envs/DRAM/lib/python3.10/subprocess.py", line 524, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['mmseqs', 'search', '/Users/yugenuni/Desktop/dramv-annotate/working_dir/final-viral-combined-for-dramv14/tmp/gene.mmsdb', '/Volumes/Yugen_HD/DRAM_databases/refseq_viral.20220824.mmsdb', '/Users/yugenuni/Desktop/dramv-annotate/working_dir/final-viral-combined-for-dramv14/tmp/gene_viral.mmsdb', '/Users/yugenuni/Desktop/dramv-annotate/working_dir/final-viral-combined-for-dramv14/tmp/tmp', '--threads', '28']' returned non-zero exit status 1.

I tried to run update_description_databases but I get "zsh: killed" which I read in another issue that it is because I don't have enough RAM. However, my computer doesn't have anymore RAM available. I was wondering how I can fix this

Cheers, Yugen

rmFlynn commented 1 year ago

Given that you have built this database before, you may want to reduce the number of threads and of course skip uniref --skip_uniref which is the most important way to reduce memory. If that does not work, you may want to try the minimal data set from issue 30. Sadly, at this point, there is no way to get around the memory issue except to have more memory.

yugen-miyahara commented 1 year ago

I am now using a high capacity storage that has unlimited storage to build the databases. I get to the point where uniref again has the wrong dates and it can't find the correct file. I fixed the date issue by importing the updated config file. I try set_database_locations and update_description_db but I still get zsh killed even though I should have unlimited memory through the HCS?