ZJONSSON / node-unzipper

node.js cross-platform unzip using streams
Other
435 stars 116 forks source link

unzipper 0.11.5 completely breaks reading of docx files #311

Closed jribbens closed 4 months ago

jribbens commented 4 months ago

Summary

unzipper 0.11.4 and earlier works fine for reading DOCX files (which are zip files). 0.11.5 however is completely broken and crashes with Error: FILE_ENDED.

Sample code

import { readFile } from 'node:fs/promises'
import * as unzipper from 'unzipper'
const ARCHIVE = 'example0.docx'
const FILENAME = 'word/_rels/document.xml.rels'
console.log('opening archive')
const archive = await unzipper.Open.buffer(await readFile(ARCHIVE))
console.log('finding file')
const file = archive.files.find(file => file.path === FILENAME)
console.log('reading file')
const contents = await file.buffer()
console.log('file length', contents.length)

Output under 0.11.4

opening archive
finding file
reading file
file length 817

Output under 0.11.5

opening archive
finding file
reading file
node:internal/process/esm_loader:97
    internalBinding('errors').triggerUncaughtException(
                              ^

Error: FILE_ENDED
    at PullStream.pull (/home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:78:28)
    at PullStream.emit (node:events:514:28)
    at PullStream.<anonymous> (/home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:15:10)
    at PullStream.emit (node:events:526:35)
    at finish (node:internal/streams/writable:748:10)
    at finishMaybe (node:internal/streams/writable:733:9)
    at afterWrite (node:internal/streams/writable:507:3)
    at onwrite (node:internal/streams/writable:480:7)
    at cb (/home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:38:14)
    at /home/jon/src/unzipper-test/node_modules/unzipper/lib/PullStream.js:71:91
    at afterWrite (node:internal/streams/writable:500:5)
    at afterWriteTick (node:internal/streams/writable:487:10)
    at process.processTicksAndRejections (node:internal/process/task_queues:81:21)

Node.js v18.17.1

Example input file

The example input file is downloadable here: https://unequivocal.eu/dl/example0.docx

jpambrun commented 4 months ago

May be because of this case image

The way I read it, there could be a file footer if the crc and length are not known ahead of time. This footer could be of length between 12 to 24 bytes.

ZJONSSON commented 4 months ago

So should we just bump the buffer to something like 100 bytes? Will check out on .docx file

ZJONSSON commented 4 months ago

So I created a simple docx file and noticed a very annoying discrepancy. It seems that the central directory shows an extraFieldLength of 0 but the local file header is showing a positive length. This is a violation of the spec, as the central directory is supposed to be the source of truth. There are two ways to approach this:

The second option is a little bit annoying to implement since we are still using .then and we can't simply return a new call to unzip. It might be worthwhile changing function to async/await to make it easier

ZJONSSON commented 4 months ago

@jribbens @jpambrun can you please checkout https://github.com/ZJONSSON/node-unzipper/blob/c80fd62cd8d387da247ea6b1802c34ae14ee838e/lib/Open/unzip.js#L37-L46 (part of https://github.com/ZJONSSON/node-unzipper/pull/312)

Here, I compare the size calculated from the local header to the size requested using the central directory and retry the unzip method if the local header points to a larger stream.

jpambrun commented 4 months ago

It's an interesting approach, but I am thinking in a file where all entries are like that it would stream the whole file twice and the performance will be cut in half?

Another idea is to have configurable extra padding and error and when it's not enough with an helpful message (e.g. Local header for entry XYZ is 76 bytes larger than specidied in the central directory, conside using a padding of >76 or more.)? And let people tweak this for their application use case?

Yet another idea is to continiue with the short stream, issue a new range request only for the missing bytes and switch the downloadding streams at the end. That feels a bit complex and error prone.

ZJONSSON commented 4 months ago

In this PR, if you look more closely, I have made the padding configurable but default to 1k. https://github.com/ZJONSSON/node-unzipper/blob/ca37f3b3c9a6ac4b5e33a4d70e4d9e341851a470/lib/Open/directory.js#L202-L208

So checking for localSize is just a fallback for when padding is not enough (and it also emits a streamRetry event that you can listen to).

jribbens commented 4 months ago

@ZJONSSON can confirm this fixes the problem in my tests

Downchuck commented 1 day ago

While a minor point -- padding can't be set to zero in this case as it'll return to 1000.