ZJONSSON / node-unzipper

node.js cross-platform unzip using streams
Other
436 stars 116 forks source link

PullStream.js causes Uncaught, unspecified "error" event. (FILE_ENDED) #18

Closed guillaumerxl closed 7 years ago

guillaumerxl commented 7 years ago

Hello,

I made a script that downloads the sirene_201612_L_M.zip from http://files.data.gouv.fr/sirene/ that contains a .csv file (file size: 8.5G). The script throws the following error:

events.js:165
      throw err;
      ^

Error: Uncaught, unspecified "error" event. (FILE_ENDED)
    at PassThrough.emit (events.js:163:17)
    at Parse.pull (/home/guillaume/dev/projects/clients/siren-api/node_modules/unzipper/lib/PullStream.js:80:11)
    at emitOne (events.js:96:13)
    at Parse.emit (events.js:188:7)
    at Parse.<anonymous> (/home/guillaume/dev/projects/clients/siren-api/node_modules/unzipper/lib/PullStream.js:19:10)
    at emitNone (events.js:91:20)
    at Parse.emit (events.js:185:7)
    at finishMaybe (_stream_writable.js:514:14)
    at afterWrite (_stream_writable.js:388:3)
    at onwrite (_stream_writable.js:378:7)
    at Immediate.WritableState.onwrite (_stream_writable.js:89:5)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)

If I refer to the code, I see that the part throwing the error is here: https://github.com/ZJONSSON/node-unzipper/blob/master/lib/PullStream.js#L78.

But the file seems to be entirely downloaded so I don't understand what could be the problem.

Here is the part of my code:

const unzipper = require('unzipper')
const request  = require('superagent')
const fs       = require('fs')

module.exports = function (url, filename) {
  return new Promise((resolve, reject) => {
    request.get(url)
      .pipe(unzipper.Parse())
      .on('entry', entry => {
        entry.pipe(fs.createWriteStream(`data/${filename}.csv`))
      })
      .promise()
      .then(
        () => {
          console.log('done')
          resolve({
           url: url,
           path: `data/${filename}.csv`
         })
        },
        e => {
          reject(err)
        }
      )
  })
}

Can you help me to understand if it's my code, a bug, a real problem with the file or another reason please? Thanks!

ZJONSSON commented 7 years ago

Thanks for the report - this seems to be some sort of race condition (definitely a bug that needs fixing). Will investigate

ZJONSSON commented 7 years ago

I have addressed the race condition here: https://github.com/ZJONSSON/node-unzipper/pull/19 - published as unzipper@0.7.4. Can you please verify that your code works now?

Here are other comments:

Here is an example of how the resolve is triggered only when the file-write has finished (not just the only when the file-write has ended:

module.exports = function (url, filename) {
  return new Promise((resolve, reject) => {
    request.get(url)
      .pipe(unzipper.ParseOne())
      .pipe(fs.createWriteStream(`data/${filename}.csv`))
      .on('error',reject)
      .on('finish',resolve({
           url: url,
           path: `data/${filename}.csv`
        });
  })
}

or using the etl library:


module.exports = function (url, filename) {
   return request.get(url)
      .pipe(unzipper.ParseOne())
      .pipe(etl.toFile(`data/${filename}.csv`))
      .promise()
      .then( () => ({
        url: url,
        path: `data/${filename}.csv`
      });
}
guillaumerxl commented 7 years ago

I pulled the repo and it works with my first code. However, your examples doesn't work, they don't finish (no errors). I will investigate further.

Thanks for your help and your work!

ZJONSSON commented 7 years ago

It seems that the original fix was wrong. The content length according to the entry header is greater than the actual length of the data -> we read past the end of the file and never finish (although we still manage to pipe all content for this single file downstream). Needs further debugging

ZJONSSON commented 7 years ago

The issue here is that the size information was stored in the 64bit extra information. Here is a fix - further debugging might be required: https://github.com/ZJONSSON/node-unzipper/commit/22ad1092aa6e1b1200e7de766e17c1dbdd2ce53b See also: https://github.com/cthackers/adm-zip/commit/5e4e4f8fcac5e9d84edd09f1ce5c76d74016b8bd

ZJONSSON commented 7 years ago

I have resolved the 64bit case but still ended up with a bad signature after the first EndOfDirectory record for your example file. I added a small hack to ignore the EndOfDirectory, so the examples above should work with: unzipper.ParseOne({bypassDirectory:true}); or unzipper.Parse({bypassDirectory:true}). The finish events should emit properly now (published as unzipper@0.7.5)

Please let me know if this works and reopen if the issues persist.

ZJONSSON commented 7 years ago

This should be a more general solution https://github.com/ZJONSSON/node-unzipper/pull/22/commits/d5db375339d5bac6f854a2f2b73f18c87e5371b8

No need to bypassDirectory - we will simply bypass any signature issues if-and-only-if we are in the central directory and there is really no more data to extract anyway.

included in unzipper@0.7.6

guillaumerxl commented 7 years ago

Okay I'll try that tomorrow and give you a feedback. I didn't find the time to check your updates.

Thanks!

guillaumerxl commented 7 years ago

Hello, I tried with the two solutions you gave me and it works well!

Thanks again!