ctapobep / blog

My personal blog on IT topics in the form of GitHub issues.
6 stars 0 forks source link

Why we can't always unpack ZIP files with ZipInputStream #22

Open ctapobep opened 2 months ago

ctapobep commented 2 months ago

Here goes the explanation of the error that we can sometimes get when using ZipInputStream:

java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor

ZIP structure

So ZIPs contain 2 parts:

  1. Entries - each entry is the content of the file and some related info
  2. Central directory - the meta information about all the files in the archive

Usually, ZIPs are read from the end of the file, where Central directory is located. This is done using ZipFile in Java. We can, on the other hand, read the file from the top - we start from the entries right away using ZipInputStream. This allows us to inspect the archive (e.g. check for the size constraints) without having to store it all first.

ZIP Entry structure & the problem

However, when reading from the top - we don't always have all the information about the entries. E.g. ZIP allows the size info to be stored either at the beginning or at the end of the entry. So it can look like this:

Entry:
  ...
  Flags=00000000
  Compressed size=N
  Uncompressed size=M
  ...
  Content. Could be compressed (compression=DEFLATE) or stored as is (compression=STORE)

Or like this:

Entry:
  ...
  Flags=00001000
  Compressed size=0
  Uncompressed size=0
  ...
  Content. Could be compressed (compression=DEFLATE) or stored as is (compression=STORE)
  Data descriptor: here goes Comporessed size=N & Uncomporessed size=M 

Now, suppose that we have the 2nd variation (size at bottom in the Data descriptor). How do we know where the Content ends and where Data descriptor starts? It's not always possible to figure it out:

In the last situation we get an error:

java.util.zip.ZipException: only DEFLATED entries can have EXT descriptor

However, if we read the file from the end (Central directory), the Compressed size is stored there too. So we won't have problems knowing where the Content ends and where Data descriptor starts.

Use cases

So ZIP can be used in 2 scenarios:

  1. Compressing files. Knowing all the information about files, it can store the sizes at the beginning of the entries
  2. Streaming. We read data from network or some other serial source and compress it on the fly and put into ZIP. We don't yet know the size of the data that we compress, so we put the size after the file content in Data descriptor.

To make matters more confusing, some archiving tools put the sizes in Data descriptors even if they compress files that they know everything about.

So if we control the parties who archive and unarchive, we can control the format and choose the streaming if needed. But if the source of the ZIP is unknown (a user uploads it), then using ZipFile is the only option that guarantees successful unzipping.