DerrickWood / kraken2

The second version of the Kraken taxonomic sequence classification system
MIT License
721 stars 273 forks source link

Kraken2 database downloading error -invalid compressed data--format violated? #545

Open mathavanpu opened 2 years ago

mathavanpu commented 2 years ago

Hello I am trying to download kraken2 database, but I got an error please help to solve this issue

Error message base) nucleome@nucleome-mathavan:~/Downloads/apps/kraken/kraken2$ ./kraken-build --standard --db test Found jellyfish v1.1.12 --2021-12-20 17:05:06-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz => ‘nucl_gb.accession2taxid.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.228, 130.14.250.12, 2607:f220:41e:250::11, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.228|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy/accession2taxid ... done. ==> SIZE nucl_gb.accession2taxid.gz ... 2140313799 ==> PASV ... done. ==> RETR nucl_gb.accession2taxid.gz ... done. Length: 2140313799 (2.0G) (unauthoritative)

nucl_gb.accession2taxid.gz 100%[================================================================================================>] 2.02G 1.60MB/s in 22m 2s

2021-12-20 17:27:13 (1.56 MB/s) - ‘nucl_gb.accession2taxid.gz’ saved [2167559647]

--2021-12-20 17:27:13-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_wgs.accession2taxid.gz => ‘nucl_wgs.accession2taxid.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.7, 165.112.9.230, 2607:f220:41e:250::12, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.7|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy/accession2taxid ... done. ==> SIZE nucl_wgs.accession2taxid.gz ... 3819279313 ==> PASV ... done. ==> RETR nucl_wgs.accession2taxid.gz ... done. Length: 3819279313 (3.6G) (unauthoritative)

nucl_wgs.accession2taxid.gz 100%[================================================================================================>] 3.60G 1.42MB/s in 40m 5s

2021-12-20 18:07:21 (1.53 MB/s) - ‘nucl_wgs.accession2taxid.gz’ saved [3869600249]

Downloaded accession to taxon map(s) --2021-12-20 18:07:21-- ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz => ‘taxdump.tar.gz’ Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.10, 130.14.250.12, 2607:f220:41e:250::11, ... Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.10|:21... connected. Logging in as anonymous ... Logged in! ==> SYST ... done. ==> PWD ... done. ==> TYPE I ... done. ==> CWD (1) /pub/taxonomy ... done. ==> SIZE taxdump.tar.gz ... 56919250 ==> PASV ... done. ==> RETR taxdump.tar.gz ... done. Length: 56919250 (54M) (unauthoritative)

taxdump.tar.gz 100%[================================================================================================>] 54.28M 1.40MB/s in 35s

2021-12-20 18:08:00 (1.54 MB/s) - ‘taxdump.tar.gz’ saved [56919250]

Downloaded taxonomy tree data Uncompressing taxonomy data... gzip: nucl_gb.accession2taxid.gz: invalid compressed data--format violated

mattheatley commented 2 years ago

I'm having the exact same issue! I found that downloading those files separately via wget (i.e. wget ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz) and doing a gunzip manually did the trick

cmsolari commented 2 years ago

Hi! I'm also having that issue. I tried to start builing my custom database using: kraken2-build --download-taxonomy --use-ftp --db kraken2DB2 And I got the same issue: gzip: nucl_gb.accession2taxid.gz: invalid compressed data--format violated

I've also tried downloading the files via wget and when I manually gunzip it, I get the same error message. Is there any other advise for this issue?

Thanks.

rsdmse commented 2 years ago

Same issue here as we are trying to download the bacteria library. (Note that if I drop --use-ftp I'd get the rsync timeout error as reported in other issues.) It proceeded with "library complete". Can we trust this or should we download again?

$ kraken2-build --download-library bacteria --db /project/apps_data/kraken2 --use-ftp
Step 1/2: Performing ftp file transfer of requested files
Step 2/2: Assigning taxonomic IDs to sequences

gzip: all/GCF_011463755.1_ASM1146375v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_013371745.1_ASM1337174v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_900636895.1_44927_E01_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_000968115.1_ASM96811v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_016925355.1_ASM1692535v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_004295645.1_ASM429564v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_004798765.1_ASM479876v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_015291645.1_ASM1529164v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_002811325.3_ASM281132v3_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_001693615.1_ASM169361v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_015910185.1_ASM1591018v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_900636795.1_44087_H02_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_019285175.1_ASM1928517v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_023735015.1_ASM2373501v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_002250965.2_ASM225096v2_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_011405515.1_ASM1140551v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_000145845.2_ASM14584v2_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_019334365.1_ASM1933436v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_006384875.1_ASM638487v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_016495865.1_ASM1649586v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_003790505.1_ASM379050v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_002074075.1_ASM207407v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_002073795.1_ASM207379v2_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_022700895.1_ASM2270089v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_024532055.1_ASM2453205v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_000625335.2_ASM62533v2_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_022318485.1_ASM2231848v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_900167985.1_IMG-taxon_2667527229_annotated_assembly_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_001605135.1_ASM160513v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_014854635.1_ASM1485463v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_009650195.1_ASM965019v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_017094605.1_ASM1709460v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_024206735.1_ASM2420673v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_023895995.1_ASM2389599v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_000160775.2_ASM16077v2_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_003432365.1_ASM343236v1_genomic.fna.gz: invalid compressed data--format violated

gzip: all/GCF_016811995.1_ASM1681199v1_genomic.fna.gz: invalid compressed data--format violated
All files processed, cleaning up extra sequence files... done, library complete.
Masking low-complexity regions of downloaded library...
mbhall88 commented 1 year ago

Still seeing this issue with the tip of master (https://github.com/DerrickWood/kraken2/commit/84e851b40f65b12cc7ba35aaed46b094b21beaf4)

Downloading nucleotide gb accession to taxon map... done.
Downloading nucleotide wgs accession to taxon map... done.
Downloaded accession to taxon map(s)
Downloading taxonomy tree data... done.
Uncompressing taxonomy data...
gzip: nucl_gb.accession2taxid.gz: invalid compressed data--format violated

Seems to be completely non-deterministic though...

I am building multiple databases with different parameters. Some pass some fail with this error.

For example, these two commands work

$ kraken2-build --standard --kmer-len 27 --minimizer-len 18 --use-ftp --max-db-size 10000000000 --minimizer-spaces 4 --threads 16 --db db

$ kraken2-build --standard --kmer-len 21 --minimizer-len 14 --use-ftp  --minimizer-spaces 3 --threads 16 --db db

But these two fail

$ kraken2-build --standard --kmer-len 35 --minimizer-len 31 --use-ftp --max-db-size 10000000000 --minimizer-spaces 7 --threads 16 --db db

$ kraken2-build --standard --kmer-len 21 --minimizer-len 14 --use-ftp --max-db-size 10000000000 --minimizer-spaces 3 --threads 16 --db db

Doesn't seem to be a pattern in parameters used...

(Note, in all of these case --db is actually a long, unique directory and I just shortened it here for brevity)