cruizperez / MicrobeAnnotator

Pipeline for metabolic annotation of microbial genomes
Artistic License 2.0
133 stars 27 forks source link

CRC check failed when building the RefSeq databse #48

Open thkuo opened 2 years ago

thkuo commented 2 years ago

Dear MicrobeAnnotator team,

I tried to fix the databse with this command:

microbeannotator_db_builder --step 9  -t 12 -m diamond --bin_path /net/sgi/metagenomics/thkuo/bin/lib/diamond/ -d /net/sgi/metagenomics/thkuo/MicrobeAnnotator_DB/ --no_aspera --single_step --keep_temp

However, it failed as below:

2022-03-29 14:36:44,374 [INFO]: This is MicrobeAnnotator v2.0.5
2022-03-29 14:36:44,375 [INFO]: I will download and format the databases I use.
2022-03-29 14:36:44,375 [INFO]: Creating database folders
2022-03-29 14:36:44,375 [INFO]: Step 9
2022-03-29 14:36:44,375 [INFO]: Downloading protein fasta files using wget.
100% [........................................................................] 20057842 / 200578422022-03-29 15:11:28,974 [INFO]: Merging protein files
Traceback (most recent call last):
  File "/home/thkuo/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 445, in <module>
    main()
  File "/home/thkuo/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 437, in main
    single_step, aspera, keep_temp, bin_path)
  File "/home/thkuo/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 178, in database_duilder
    database_directory, threads)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/site-packages/microbeannotator/database/refseq_data_downloader.py", line 162, in refseq_fasta_downloader_wget
    copyfileobj(temp_file,merged_db)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/shutil.py", line 79, in copyfileobj
    buf = fsrc.read(length)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 300, in read1
    return self._buffer.read1(size)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 465, in read
    self._read_eof()
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 512, in _read_eof
    hex(self._crc)))
OSError: CRC check failed 0xaa6010e4 != 0x61343b5e

When I tried it one more time, the error message became:

2022-03-29 17:10:25,448 [INFO]: This is MicrobeAnnotator v2.0.5
2022-03-29 17:10:25,448 [INFO]: I will download and format the databases I use.
2022-03-29 17:10:25,448 [INFO]: Creating database folders
2022-03-29 17:10:25,450 [INFO]: Step 9
2022-03-29 17:10:25,450 [INFO]: Downloading protein fasta files using wget.
100% [........................................................................] 20057842 / 200578422022-03-29 17:46:42,853 [INFO]: Merging protein files
Traceback (most recent call last):
  File "/home/thkuo/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 445, in <module>
    main()
  File "/home/thkuo/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 437, in main
    single_step, aspera, keep_temp, bin_path)
  File "/home/thkuo/miniconda3/envs/microbeannotator/bin/microbeannotator_db_builder", line 178, in database_duilder
    database_directory, threads)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/site-packages/microbeannotator/database/refseq_data_downloader.py", line 162, in refseq_fasta_downloader_wget
    copyfileobj(temp_file,merged_db)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/shutil.py", line 79, in copyfileobj
    buf = fsrc.read(length)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 300, in read1
    return self._buffer.read1(size)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 482, in read
    uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: invalid distance code

It looks like some problems in the compression procedure. What could be the cause?

silvtal commented 2 years ago

This kind of errors happen because the original software doesn't take into account that sometimes the database files can get corrupted while downloading if the download stops halfway

I made minor edits on some files to avoid this issue, I think they might help you https://github.com/cruizperez/MicrobeAnnotator/pull/38 (you would have to download from my fork here)

thkuo commented 2 years ago

Thank you for the suggestion. However, I tried your fork and it couldn't really work with my environment:

(microbeannotator) thkuo@titan-compute-01:/net/sgi/metagenomics/thkuo/bin/test_microbeannotator$ ~/bin/MicrobeAnnotator.beta/bin/microbeannotator_db_builder --step 9  -t 12 -m diamond --bin_path /net/sgi/metagenomics/thkuo/bin/lib/diamond/ -d /net/sgi/metagenomics/thkuo/MicrobeAnnotator_DB/ --no_aspera
2022-04-04 14:21:54,197 [INFO]: This is MicrobeAnnotator v2.0.5
2022-04-04 14:21:54,197 [INFO]: I will download and format the databases I use.
2022-04-04 14:21:54,197 [INFO]: Creating database folders
2022-04-04 14:21:54,198 [INFO]: Step 9
2022-04-04 14:21:54,198 [INFO]: Downloading protein fasta files using wget.
100% [........................................................................] 20057842 / 200578422022-04-04 14:58:01,655 [INFO]: Merging protein files
Traceback (most recent call last):
  File "/home/thkuo/bin/MicrobeAnnotator.beta/bin/microbeannotator_db_builder", line 459, in <module>
    main()
  File "/home/thkuo/bin/MicrobeAnnotator.beta/bin/microbeannotator_db_builder", line 451, in main
    single_step, aspera, keep_temp, excludetrembl, bin_path)
  File "/home/thkuo/bin/MicrobeAnnotator.beta/bin/microbeannotator_db_builder", line 184, in database_builder
    database_directory, threads)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/site-packages/microbeannotator/database/refseq_data_downloader.py", line 162, in refseq_fasta_downloader_wget
    copyfileobj(temp_file,merged_db)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/shutil.py", line 79, in copyfileobj
    buf = fsrc.read(length)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 300, in read1
    return self._buffer.read1(size)
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/home/thkuo/miniconda3/envs/microbeannotator/lib/python3.7/gzip.py", line 482, in read
    uncompress = self._decompressor.decompress(buf, size)
zlib.error: Error -3 while decompressing data: invalid block type

In case you want to check the version, below shows the information:

(microbeannotator) thkuo@titan-compute-01:~/bin/MicrobeAnnotator.beta$ git remote show origin
* remote origin
  Fetch URL: https://github.com/silvtal/MicrobeAnnotator.git
  Push  URL: https://github.com/silvtal/MicrobeAnnotator.git
  HEAD branch: master
  Remote branches:
    add-license-1 tracked
    development   tracked
    master        tracked
  Local branch configured for 'git pull':
    master merges with remote master
  Local ref configured for 'git push':
    master pushes to master (up to date)
* master
* 9b3620b silvtal, Wed Dec 22 13:00:25 2021 +0100: db_builder: re-download corrupted genbank downloads at step 10, merge steps 10 and 11, fix db_builder sqlite step; microbeannotator: fix --method_bin option
* c29275c silvtal, Tue Dec 14 16:47:48 2021 +0100: added corrupted RefSeq file correcting step
* d7220c1 silvtal, Thu Dec 9 20:42:07 2021 +0100: added --excludetrembl option to db builder
AhmedElsherbini commented 1 year ago

Hi guys, I have the same issue, did you find a solution?

volcanihpc commented 2 months ago

Hello, It has something to do with NCBI's FTP. The workaround to fix this issue is to change 'ftp://' to 'https://' in lines 144 and 241 in _.../database/refseq_datadownloader.py

EDIT: Also, clean the 'temp_refseq_proteins' folder from previously downloaded files.