WrightonLabCSU / DRAM

Distilled and Refined Annotation of Metabolism: A tool for the annotation and curation of function for microbial and viral genomes
GNU General Public License v3.0
249 stars 52 forks source link

URLopen error #254

Closed spencerlong1 closed 1 year ago

spencerlong1 commented 1 year ago

Hi,

Left DRAM downloading the databases last few days and have run into the following error both times: (which I know is common):

(DRAM) [sdl1u18@cyan51 ~]$ cd ../../scratch/sdl1u18/ (DRAM) [sdl1u18@cyan51 sdl1u18]$ DRAM-setup.py prepare_databases --output_dir ../../scratch/sdl1u18/ /home/sdl1u18/.conda/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_handler.py:123: UserWarning: Database does not exist at path None warnings.warn("Database does not exist at path %s" % description_loc) 2023-01-17 10:14:16,620 - Starting the process of downloading data 2023-01-17 10:14:16,620 - The kegg_loc argument was not used to specify a downloaded kegg file, and dram can not download it its self. So it is assumed that the user wants to set up DRAM without it 2023-01-17 10:14:16,620 - The gene_ko_link_loc argument was not used to specify a downloaded gene_ko_link file, and dram can not download it its self. So it is assumed that the user wants to set up DRAM without it 2023-01-17 10:14:16,620 - Database preparation started 2023-01-17 10:14:16,620 - Downloading kofam_hmm 2023-01-17 10:20:25,663 - Downloading kofam_ko_list 2023-01-17 10:20:30,338 - Downloading uniref 2023-01-17 18:47:01,406 - Downloading pfam 2023-01-17 18:48:11,888 - Downloading pfam_hmm 2023-01-17 18:48:12,088 - Downloading dbcan 2023-01-17 18:48:17,232 - Downloading dbcan_fam_activities 2023-01-17 18:48:17,232 - Downloading dbCAN family activities from : https://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fam-activities.txt 2023-01-17 18:48:17,878 - Downloading dbcan_subfam_ec 2023-01-17 18:48:17,879 - Downloading dbCAN sub-family encumber from : https://bcb.unl.edu/dbCAN2/download/Databases/V11/CAZyDB.08062022.fam.subfam.ec.txt 2023-01-17 18:48:18,887 - Downloading vogdb 2023-01-17 18:48:25,272 - Downloading vog_annotations 2023-01-17 18:48:25,593 - Downloading viral 2023-01-17 18:48:37,411 - Something went wrong with the download of the url: ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.protein.faa.gz 2023-01-17 18:48:37,411 - <urlopen error <urlopen error ftp error: error_perm('550 viral.2.protein.faa.gz: No such file or directory')>> 2023-01-17 18:48:37,840 - Something went wrong with the download of the url: https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/viral.2.protein.faa.gz 2023-01-17 18:48:37,840 - HTTP Error 404: Not Found Traceback (most recent call last): File "/home/sdl1u18/.conda/envs/DRAM/bin/DRAM-setup.py", line 184, in args.func(**args_dict) File "/home/sdl1u18/.conda/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 532, in prepare_databases locs[i] = download_functions[i]( File "/home/sdl1u18/.conda/envs/DRAM/lib/python3.10/site-packages/mag_annotator/database_processing.py", line 218, in download_viral download_file(url, output_name, logger, alt_urls=[url_http], verbose=verbose) File "/home/sdl1u18/.conda/envs/DRAM/lib/python3.10/site-packages/mag_annotator/utils.py", line 33, in download_file raise URLError("DRAM whas not able to download a key database, check the logg for details") urllib.error.URLError: <urlopen error DRAM whas not able to download a key database, check the logg for details>

Looks like the viral.2.protein.faa.gz hasnt downloaded. I see in my database_files that viral.1.protein.faa.gz is present, so I am wondering why this might be? Those that run our HPC dont seem to think it is the firewall (which let through everything else so far), and fttp seems fine if viral.1. has made it through. Was also just wondering how much more is required after this step, as I will just use the database_loc commands for what is already there (assuming uniref and pfam etc seem fine at this stage?)

apologies if basic, I am new to DRAM and annotation software as a whole!

Cheers! Spencer

nikolasbasler commented 1 year ago

Hello,

I am getting the same error, also with DRAM 1.4.5, which was recently made available on conda.

Looking at the ftp address (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/), there is no viral.2.protein.faa.gz, so it's not surprising DRAM doesn't find it. In case it helps to find out what's going on: All files in that folders are only a few days old (2023-01-13) and one day earlier (so on 12th), I could sucessfully download and prepare the databases, including viral (but with --skip_uniref). Now there is only a viral.1.protein.faa.gz at that ftp address.

Cheers, Nikolas

spencerlong1 commented 1 year ago

Hello,

I am getting the same error, also with DRAM 1.4.5, which was recently made available on conda.

Looking at the ftp address (https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/), there is no viral.2.protein.faa.gz, so it's not surprising DRAM doesn't find it. In case it helps to find out what's going on: All files in that folders are only a few days old (2023-01-13) and one day earlier (so on 12th), I could sucessfully download and prepare the databases, including viral (but with --skip_uniref). Now there is only a viral.1.protein.faa.gz at that ftp address.

Cheers, Nikolas

Hi Nikolas,

good find, and I am seeing the same thing. I wonder if viral.2. is no longer needed, and in that case, there is a way to skip it , or alternatively a way to find the old versions. I will play around today.

Cheers, Spencer

rmFlynn commented 1 year ago

Thanks guys looks like we might have to update the path I will get on it.

rmFlynn commented 1 year ago

It looks like the change is real and also that it is here to stay. I am testing a fix to only pull one file now and will make a new point release when it is done. Or more likely, I will have @dmitrisvetlov do it.

JoseLopezArcondo commented 1 year ago

Hello, I am new to DRAM. I had the same issue with the conda latest version. How should we skip or solve this? If I run DRAM now it seems it cannot find any database path (although they are in "DRAM_data/database_files", probably because the database download process did not end up correctly because of lacking viral2 files??

rmFlynn commented 1 year ago

First can you post the output of DRAM-setup.py version?

rmFlynn commented 1 year ago

As in issue #236; you can download the file from https://ftp.ncbi.nlm.nih.gov/refseq/release/viral/. Put it in a folder on your server and point to it using DRAM-setup.py prepare_databases --viral_loc viral_file.faa.gz there will only be one file but if in the future there are separate viral files, you can cat them together to make the merged.faa.gz.

JoseLopezArcondo commented 1 year ago

Thanks. The output of DRAM-setup.py version: 1.4.5 Allright, I downloaded viral.1.protein.faa.gz file and put it in DRAM_data folder, but when running DRAM-setup.py prepare_databases --viral_loc path_to_viral.1.protein.faa.gz this happens: FileExistsError: [Errno 17] File exists: './database_files' I guess as I already run the DRAM-setup.py prepare_databases --output_dir DRAM_data step, it collides with the already downloaded files...

rmFlynn commented 1 year ago

The latest version of DRAM in conda is 1.4.6 https://anaconda.org/bioconda/dram, you will see in the release notes that that is the point release for single viral files. You may want to upgrade for future stability.

Yes, you must put it in a new location or delete the failed folder. You must set up all the databases at the same time for now; At least if you want to have a reliable set up, that is what you must do.

JoseLopezArcondo commented 1 year ago

Allright, but is there any way to use already downloaded files, or I must remove DRAM_data folder and download everything again with the new version installed? thanks

rmFlynn commented 1 year ago

You can use already downloaded files using the -loc_ arguments for each. Use DRAM-setup.py prepare_databases --help to see the many arguments. Then use a new location for the output. It would be more work than it is worth, in my opinion, the downloading is typically the fast part of the setup process.

wuhuiyun07 commented 1 year ago

I am using DRAM version 1.4.6, and also have the same error: database_handler.py:123: UserWarning: Database does not exist at path None warnings.warn("Database does not exist at path %s" % description_loc)