PullStream.js causes Uncaught, unspecified "error" event. (FILE_ENDED)

guillaumerxl commented 7 years ago

Hello,

I made a script that downloads the sirene_201612_L_M.zip from http://files.data.gouv.fr/sirene/ that contains a .csv file (file size: 8.5G). The script throws the following error:

events.js:165
      throw err;
      ^

Error: Uncaught, unspecified "error" event. (FILE_ENDED)
    at PassThrough.emit (events.js:163:17)
    at Parse.pull (/home/guillaume/dev/projects/clients/siren-api/node_modules/unzipper/lib/PullStream.js:80:11)
    at emitOne (events.js:96:13)
    at Parse.emit (events.js:188:7)
    at Parse.<anonymous> (/home/guillaume/dev/projects/clients/siren-api/node_modules/unzipper/lib/PullStream.js:19:10)
    at emitNone (events.js:91:20)
    at Parse.emit (events.js:185:7)
    at finishMaybe (_stream_writable.js:514:14)
    at afterWrite (_stream_writable.js:388:3)
    at onwrite (_stream_writable.js:378:7)
    at Immediate.WritableState.onwrite (_stream_writable.js:89:5)
    at runCallback (timers.js:637:20)
    at tryOnImmediate (timers.js:610:5)
    at processImmediate [as _immediateCallback] (timers.js:582:5)

If I refer to the code, I see that the part throwing the error is here: https://github.com/ZJONSSON/node-unzipper/blob/master/lib/PullStream.js#L78.

But the file seems to be entirely downloaded so I don't understand what could be the problem.

Here is the part of my code:

const unzipper = require('unzipper')
const request  = require('superagent')
const fs       = require('fs')

module.exports = function (url, filename) {
  return new Promise((resolve, reject) => {
    request.get(url)
      .pipe(unzipper.Parse())
      .on('entry', entry => {
        entry.pipe(fs.createWriteStream(`data/${filename}.csv`))
      })
      .promise()
      .then(
        () => {
          console.log('done')
          resolve({
           url: url,
           path: `data/${filename}.csv`
         })
        },
        e => {
          reject(err)
        }
      )
  })
}

Can you help me to understand if it's my code, a bug, a real problem with the file or another reason please? Thanks!

ZJONSSON commented 7 years ago

Thanks for the report - this seems to be some sort of race condition (definitely a bug that needs fixing). Will investigate

ZJONSSON commented 7 years ago

I have addressed the race condition here: https://github.com/ZJONSSON/node-unzipper/pull/19 - published as unzipper@0.7.4. Can you please verify that your code works now?

Here are other comments:

You are not guaranteed that the writeStream has finished when the promise resolved (only that the zip file has been read).
If you are only reading one file you could use parseOne.
You get the underlying filename in entry.path if that is helpful for the output filename

Here is an example of how the resolve is triggered only when the file-write has finished (not just the only when the file-write has ended:

module.exports = function (url, filename) {
  return new Promise((resolve, reject) => {
    request.get(url)
      .pipe(unzipper.ParseOne())
      .pipe(fs.createWriteStream(`data/${filename}.csv`))
      .on('error',reject)
      .on('finish',resolve({
           url: url,
           path: `data/${filename}.csv`
        });
  })
}

or using the etl library:


module.exports = function (url, filename) {
   return request.get(url)
      .pipe(unzipper.ParseOne())
      .pipe(etl.toFile(`data/${filename}.csv`))
      .promise()
      .then( () => ({
        url: url,
        path: `data/${filename}.csv`
      });
}

guillaumerxl commented 7 years ago

I pulled the repo and it works with my first code. However, your examples doesn't work, they don't finish (no errors). I will investigate further.

Thanks for your help and your work!

ZJONSSON commented 7 years ago

It seems that the original fix was wrong. The content length according to the entry header is greater than the actual length of the data -> we read past the end of the file and never finish (although we still manage to pipe all content for this single file downstream). Needs further debugging

ZJONSSON commented 7 years ago

The issue here is that the size information was stored in the 64bit extra information. Here is a fix - further debugging might be required: https://github.com/ZJONSSON/node-unzipper/commit/22ad1092aa6e1b1200e7de766e17c1dbdd2ce53b See also: https://github.com/cthackers/adm-zip/commit/5e4e4f8fcac5e9d84edd09f1ce5c76d74016b8bd

ZJONSSON commented 7 years ago

I have resolved the 64bit case but still ended up with a bad signature after the first EndOfDirectory record for your example file. I added a small hack to ignore the EndOfDirectory, so the examples above should work with: unzipper.ParseOne({bypassDirectory:true}); or unzipper.Parse({bypassDirectory:true}). The finish events should emit properly now (published as unzipper@0.7.5)

Please let me know if this works and reopen if the issues persist.

ZJONSSON commented 7 years ago

This should be a more general solution https://github.com/ZJONSSON/node-unzipper/pull/22/commits/d5db375339d5bac6f854a2f2b73f18c87e5371b8

No need to bypassDirectory - we will simply bypass any signature issues if-and-only-if we are in the central directory and there is really no more data to extract anyway.

included in unzipper@0.7.6

guillaumerxl commented 7 years ago

Okay I'll try that tomorrow and give you a feedback. I didn't find the time to check your updates.

Thanks!

guillaumerxl commented 7 years ago

Hello, I tried with the two solutions you gave me and it works well!

Thanks again!

ZJONSSON / node-unzipper

PullStream.js causes Uncaught, unspecified "error" event. (FILE_ENDED) #18