AstrobioMike / bit

Bioinformatics Tools
GNU General Public License v3.0
81 stars 11 forks source link

gzip downloads are malformatted #6

Closed bkille closed 3 years ago

bkille commented 3 years ago

I have a file containing accession IDs named mouse_acc.txt.

>$ bit-dl-ncbi-assemblies -w mouse_acc.txt -f fasta -j 10

    Targeting 16 genomes in fasta format.

              DONE!

>$ ls -lth
total 12G
-rw-r--r-- 1 blk6 tgen 709M Apr 15 15:00 GCA_001632615.1.fa.gz
-rw-r--r-- 1 blk6 tgen 716M Apr 15 15:00 GCA_001632555.1.fa.gz
-rw-r--r-- 1 blk6 tgen 715M Apr 15 15:00 GCA_001632525.1.fa.gz
-rw-r--r-- 1 blk6 tgen 714M Apr 15 15:00 GCA_001632575.1.fa.gz
-rw-r--r-- 1 blk6 tgen 711M Apr 15 15:00 GCA_001624775.1.fa.gz
-rw-r--r-- 1 blk6 tgen 700M Apr 15 15:00 GCA_001624835.1.fa.gz
-rw-r--r-- 1 blk6 tgen 708M Apr 15 15:00 GCA_001624535.1.fa.gz
-rw-r--r-- 1 blk6 tgen 716M Apr 15 15:00 GCA_001624745.1.fa.gz
-rw-r--r-- 1 blk6 tgen 720M Apr 15 15:00 GCA_001624475.1.fa.gz
-rw-r--r-- 1 blk6 tgen 796M Apr 15 15:00 GCA_000001635.8.fa.gz
-rw-r--r-- 1 blk6 tgen 722M Apr 15 15:00 GCA_001624295.1.fa.gz
-rw-r--r-- 1 blk6 tgen 709M Apr 15 15:00 GCA_001624505.1.fa.gz
-rw-r--r-- 1 blk6 tgen 715M Apr 15 15:00 GCA_001624185.1.fa.gz
-rw-r--r-- 1 blk6 tgen 718M Apr 15 15:00 GCA_001624215.1.fa.gz
-rw-r--r-- 1 blk6 tgen 714M Apr 15 15:00 GCA_001624675.1.fa.gz
-rw-r--r-- 1 blk6 tgen 705M Apr 15 15:00 GCA_001624445.1.fa.gz
-rw-r--r-- 1 blk6 tgen 361M Apr 15 14:55 ncbi_assembly_info.tsv
-rw-r--r-- 1 blk6 tgen  256 Apr 15 14:49 mouse_acc.txt

>$ gunzip *.gz

gzip: GCA_000001635.8.fa.gz: invalid compressed data--format violated

Note that I get this invalid compressed data--format violated for all downloaded .gz files. I've also tried running with the accessions and commands from the docs

Am I doing something wrong or is this a classic case of NCBI changing things up and breaking peoples code? :slightly_smiling_face:

AstrobioMike commented 3 years ago

Hey there, @bkille :)

I can't seem to recreate the trouble. I wonder if the downloads are failing to finish properly for some reason, and since I wrote this a while ago when I was even worse at coding than I am now, ha, the program doesn't really check properly that they downloaded without error (right now it just checks there's something in the downloaded file, so as currently written it wouldn't be telling us if there was a problem if the download started and failed).

Can you try with just this accession in the target accessions file and let me know some things (it's a smaller one so it'll be quicker for testing):

GCA_006538345.1

curl -o GCA_006538345.1.fa.gz ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/006/538/345/GCA_006538345.1_ASM653834v1/GCA_006538345.1_ASM653834v1_genomic.fna.gz
bkille commented 3 years ago

Yup, same error w/ just that file...

bkille commented 3 years ago

Seems like something specific to our server... :thinking: I was able to get the commands to work on a separate device. Still not sure whats going on though lol. Sorry for the false alarm :slightly_smiling_face: