Here goes the explanation of the error that we can sometimes get when using ZipInputStream:
java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor
ZIP structure
So ZIPs contain 2 parts:
Entries - each entry is the content of the file and some related info
Central directory - the meta information about all the files in the archive
Usually, ZIPs are read from the end of the file, where Central directory is located. This is done using ZipFile in Java. We can, on the other hand, read the file from the top - we start from the entries right away using ZipInputStream. This allows us to inspect the archive (e.g. check for the size constraints) without having to store it all first.
ZIP Entry structure & the problem
However, when reading from the top - we don't always have all the information about the entries. E.g. ZIP allows the size info to be stored either at the beginning or at the end of the entry. So it can look like this:
Entry:
...
Flags=00000000
Compressed size=N
Uncompressed size=M
...
Content. Could be compressed (compression=DEFLATE) or stored as is (compression=STORE)
Or like this:
Entry:
...
Flags=00001000
Compressed size=0
Uncompressed size=0
...
Content. Could be compressed (compression=DEFLATE) or stored as is (compression=STORE)
Data descriptor: here goes Comporessed size=N & Uncomporessed size=M
Now, suppose that we have the 2nd variation (size at bottom in the Data descriptor). How do we know where the Content ends and where Data descriptor starts? It's not always possible to figure it out:
If the compression=DEFLATE, the structure of the compressed data itself signifies where it ends
But if it's compression=STORE, there's no way to know where it ends!
In the last situation we get an error:
java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor
However, if we read the file from the end (Central directory), the Compressed size is stored there too. So we won't have problems knowing where the Content ends and where Data descriptor starts.
Use cases
So ZIP can be used in 2 scenarios:
Compressing files. Knowing all the information about files, it can store the sizes at the beginning of the entries
Streaming. We read data from network or some other serial source and compress it on the fly and put into ZIP. We don't yet know the size of the data that we compress, so we put the size after the file content in Data descriptor.
To make matters more confusing, some archiving tools put the sizes in Data descriptors even if they compress files that they know everything about.
So if we control the parties who archive and unarchive, we can control the format and choose the streaming if needed. But if the source of the ZIP is unknown (a user uploads it), then using ZipFile is the only option that guarantees successful unzipping.
Here goes the explanation of the error that we can sometimes get when using
ZipInputStream
:ZIP structure
So ZIPs contain 2 parts:
Central directory
- the meta information about all the files in the archiveUsually, ZIPs are read from the end of the file, where
Central directory
is located. This is done usingZipFile
in Java. We can, on the other hand, read the file from the top - we start from the entries right away usingZipInputStream
. This allows us to inspect the archive (e.g. check for the size constraints) without having to store it all first.ZIP Entry structure & the problem
However, when reading from the top - we don't always have all the information about the entries. E.g. ZIP allows the size info to be stored either at the beginning or at the end of the entry. So it can look like this:
Or like this:
Now, suppose that we have the 2nd variation (size at bottom in the
Data descriptor
). How do we know where theContent
ends and whereData descriptor
starts? It's not always possible to figure it out:compression=DEFLATE
, the structure of the compressed data itself signifies where it endscompression=STORE
, there's no way to know where it ends!In the last situation we get an error:
However, if we read the file from the end (
Central directory
), theCompressed size
is stored there too. So we won't have problems knowing where theContent
ends and whereData descriptor
starts.Use cases
So ZIP can be used in 2 scenarios:
Data descriptor
.To make matters more confusing, some archiving tools put the sizes in
Data descriptors
even if they compress files that they know everything about.So if we control the parties who archive and unarchive, we can control the format and choose the streaming if needed. But if the source of the ZIP is unknown (a user uploads it), then using
ZipFile
is the only option that guarantees successful unzipping.