cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
140 stars 28 forks source link

Downloading protein fasta files stoped at step 9 #33

Open ywangbioinfo opened 3 years ago

ywangbioinfo commented 3 years ago

Dear MicrobeAnnotator developer,

Today, when I created MicrobeAnnotator database, I got trouble at step 9. Downloading always stopped as follows. I am waiting for your advice.

$ microbeannotator_db_builder -d MicrobeAnnotator_DB -m diamond --step 9 --no_aspera 2021-08-05 16:05:18,015 [INFO]: This is MicrobeAnnotator v2.0.4 2021-08-05 16:05:18,016 [INFO]: I will download and format the databases I use. 2021-08-05 16:05:18,016 [INFO]: Creating database folders 2021-08-05 16:05:18,016 [INFO]: Step 9 2021-08-05 16:05:18,016 [INFO]: Downloading protein fasta files using wget. 100% [........................................................] 18619784 / 18619784multiprocessing.pool.RemoteTraceback: """ Traceback (most recent call last): File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 1573, in ftp_open fp, retrlen = fw.retrfile(file, type) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 2437, in retrfile conn, retrlen = self.ftp.ntransfercmd(cmd) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/ftplib.py", line 361, in ntransfercmd source_address=self.source_address) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/socket.py", line 728, in create_connection raise err File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/socket.py", line 716, in create_connection sock.connect(sa) TimeoutError: [Errno 110] Connection timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/multiprocessing/pool.py", line 121, in worker result = (True, func(*args, *kwds)) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/site-packages/microbeannotator/database/refseq_data_downloader.py", line 267, in refseq_multiprocess_downloader wget.download(file_url, out=output) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/site-packages/wget.py", line 526, in download (tmpfile, headers) = ulib.urlretrieve(binurl, tmpfile, callback) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 247, in urlretrieve with contextlib.closing(urlopen(url, data)) as fp: File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 222, in urlopen return opener.open(url, data, timeout) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 525, in open response = self._open(req, data) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 543, in _open '_open', req) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 503, in _call_chain result = func(*args) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 1584, in ftp_open raise exc.with_traceback(sys.exc_info()[2]) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 1573, in ftp_open fp, retrlen = fw.retrfile(file, type) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/urllib/request.py", line 2437, in retrfile conn, retrlen = self.ftp.ntransfercmd(cmd) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/ftplib.py", line 361, in ntransfercmd source_address=self.source_address) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/socket.py", line 728, in create_connection raise err File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/socket.py", line 716, in create_connection sock.connect(sa) urllib.error.URLError: <urlopen error ftp error: TimeoutError(110, 'Connection timed out')> """

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/ubuntu20/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 445, in main() File "/home/ubuntu20/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 437, in main single_step, aspera, keep_temp, bin_path) File "/home/ubuntu20/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 178, in database_duilder database_directory, threads) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/site-packages/microbeannotator/database/refseq_data_downloader.py", line 151, in refseq_fasta_downloader_wget pool.map(refseq_multiprocess_downloader, file_list) File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/ubuntu20/miniconda3/envs/microbeannotator/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value urllib.error.URLError: <urlopen error ftp error: TimeoutError(110, 'Connection timed out')>

sheaster commented 3 years ago

I had a similar problem with step 9 until I used aspera. It appears that the aspera download sorts the downloads of step 9 into folders (bacteria and viral) while the wget (--no_aspera) does not. not sure if this is causing the problem or not...

silvtal commented 3 years ago

I had the same problem so I'm trying something else. I hope it helps.

Basically, you can edit the scripts and comment out the steps giving you problems / that you don't need. So for step 9:

1) find the source file

whereis microbeannotator_db_builder
>> /usr/local/bin/microbeannotator_db_builder
gedit /usr/local/bin/microbeannotator_db_builder

2) comment out the refseq step at line 171

    # Download RefSeq Proteins
    if step == 9:
        logger.info(f"Step 9")
#        if aspera:
#            refseq_prot = refseq.refseq_fasta_downloader(database_directory)
#        else:
#            refseq_prot = refseq.refseq_fasta_downloader_wget(
#                database_directory, threads)
#        database_files['RefSeq_Fasta'] = str(refseq_prot)
        if single_step:
            step = 15
        else:
            step += 1

3) save the following code as a short .py script and run it separately (if it doesn't work either, you can do it manually and store a merged "refseq_protein.fasta" file it in /"protein_db")

from microbeannotator.database import refseq_data_downloader as r
db = <database_directory(-d flag for microbeannotator_db_builder>
r.refseq_fasta_downloader_wget(output_file_folder=db,threads=1)

If it doesn't work either, you can do it manually and store a merged "refseq_protein.fasta" file it in /"protein_db"). You're supposed to download the "protein.faa.gz" files from https://ftp.ncbi.nlm.nih.gov/refseq/release/{domain}/ where {domain} is viral, bacteria and archaea, then merge all of them together like the refseq_fasta_downloader_wget function does

EricDeveaud commented 2 years ago

wrong window