fhs / ZipFile.jl

Read/Write ZIP archives in Julia
Other
50 stars 45 forks source link

`ERROR: invalid file header` with NCBI genomic data, file works in python. #89

Open pcjentsch opened 2 years ago

pcjentsch commented 2 years ago

This archive opens fine with python's zipfiles.

It is a 39GB so I cannot include a working example easily but if you are inclined to try it:

conda create -n ncbi_datasets
conda activate ncbi_datasets
conda install -c conda-forge ncbi-datasets-cli
datasets download virus genome taxon sars-cov-2 --host human

The output from zipfiles.infolist() in python is

zipf.infolist()
[<ZipInfo filename='README.md' compress_type=deflate filemode='?rw-------' file_size=1                  604 compress_size=769>, <ZipInfo filename='ncbi_dataset/data/data_report.jsonl' compress_type=deflate filemode='?rw-------' file_size=81889507642 compress_size=4292597995>, <ZipInfo filename='ncbi_dataset/data/biosample.jsonl' compress_type=deflate filemode='?rw-------' file_size=7826671566 compress_size=205379661>, <ZipInfo filename='ncbi_dataset/data/cds.fna' compress_type=deflate filemode='?rw-------' file_siz                            e=177621771946 compress_size=11180195822>, <ZipInfo filename='ncbi_dataset/data/genomic.fna' compress_type=deflate filemode='?rw-------' file_size=167811715365 compress_size=13523743233>, <ZipInfo filename='ncbi_dataset/data/protein.faa' compress_type=deflate filemode='?rw-------' file_size=82837067420 compress_size=3110887927>, <ZipInfo filename='ncbi_dataset/data/virus_dataset.md' compress_type=deflate filemode='?rw-------' file_size=2431 compress_size=1057>, <ZipInfo filename='ncbi_dataset/data/dataset_catalog.json' compress_type=deflate filemode='?rw-------' file_size=845 compress_size=321>]

if that is helpful.

fhs commented 2 years ago

You may want to try version 0.10.0 which I just released, which has support for reading zip64 files.