ZJONSSON / node-unzipper

node.js cross-platform unzip using streams
Other
435 stars 116 forks source link

Unexpected end of file Zlib zlibOnError 5 Z_BUF_ERROR #149

Open franky-continu opened 5 years ago

franky-continu commented 5 years ago

Hi guys, I have a code that has been working for some time now, but it seems we are receiving a larger than usual file, 140mb.

We unzip SCORM packages (zip files) before uploading them to AWS.

var new_file_path = files.file.path.substr(0, files.file.path.lastIndexOf("/")) + '/' + files.file.name;
        fs.renameSync(files.file.path, new_file_path);
        var extract_to_directory = files.file.path.substr(0, files.file.path.lastIndexOf("/")) + '/' + scormContent._id.toString();
        fs.createReadStream(new_file_path).pipe(unzipper.Extract({path: extract_to_directory, concurrency: 5})).on('error', function(error) {
          console.log('Scorm (self hosted) error unzipping files:', error);
          return reject({status: 422, message: 'Scorm (self hosted) error unzipping files:', error: error});
        }).on('finish', function() {
        //..... Code to upload to AWS
        });

I keep getting:

{ Error: unexpected end of file
    at Zlib.zlibOnError [as onerror] (zlib.js:134:17) errno: -5, code: 'Z_BUF_ERROR' }
ZJONSSON commented 5 years ago

@franky-continu thanks for the report. There are two approaches to unzipping, either streaming (which reads the zip-file sequentially relying on local file headers without looking at the central directory) and Open which starts by looking at central directory and ignores the local file headers. The Open methods are generically safer as the Central Directory is considered the true metadata for the zip files.

Also please make sure you are using the latest version unzipper@0.10.4 in case any file data is stored as zip64 (see https://github.com/ZJONSSON/node-unzipper/pull/146)

The extract method under Open returns a promise and can be found here

If you have an example zip file that is not sensitive, please feel free to share a link to the file with me (ziggy.jonsson.nyc@gmail.com) and I can help debugging manually.

franky-continu commented 5 years ago

First of all, thank you so much for the rapid and thorough response.

I am using the latest version. and I'm going to try the Open example.

I'll send you a link to the file if this doesn't work, and I'll add a report on the results as well in case others are also looking at an issue like this one.

Thank you so so much.

franky-continu commented 5 years ago

Same error:


var new_file_path = files.file.path.substr(0, files.file.path.lastIndexOf("/")) + '/' + files.file.name;
        fs.renameSync(files.file.path, new_file_path);
        var extract_to_directory = files.file.path.substr(0, files.file.path.lastIndexOf("/")) + '/' + scormContent._id.toString();

        //fs.createReadStream(new_file_path).pipe(unzipper.Extract({path: extract_to_directory})).on('error', function(error) {
        unzipper.Open.file(new_file_path).then(directory => directory.extract({path: extract_to_directory, concurrency: 5})).then(function() {
.........
.........

        }).catch(function(error) {
          console.log('Scorm (self hosted) error unzipping files:', error);
          return reject({status: 422, message: 'Scorm (self hosted) error unzipping files:', error: error});
        });

I'll send you a link.

LOG:

Scorm (self hosted) error unzipping files: { Error: unexpected end of file
    at Zlib.zlibOnError [as onerror] (zlib.js:134:17) errno: -5, code: 'Z_BUF_ERROR' }
20:33:46.978Z  INFO continu-api: Outcoming Response
  HTTP/1.1 422 Unprocessable Entity
  Server: continu-api
  access-control-allow-origin: https://local.continu.co
  vary: origin
  access-control-expose-headers: Authorization, api-version, content-length, content-md5, content-type, date, request-id, response-time
  Content-Encoding: gzip
  Content-Type: application/json
ZJONSSON commented 5 years ago

Thanks for all the detail here @franky-continu. It appears that some files had fileSizeKnown flag as false but a non-zero compressedSize. Can you please check out https://github.com/ZJONSSON/node-unzipper/pull/150: npm install zjonsson/node-unzipper#filesizeknown and see if the problem has been resolved.

franky-continu commented 5 years ago

OMG I was just about to post this:

alias npm='node --max_old_space_size=8000 /usr/bin/npm'

About adding more memory to the process.

I'll check this version right now

franky-continu commented 5 years ago

Hmm so far it hasn't. my package entry and I manually installed it with npm install ....... --save "unzipper": "github:zjonsson/node-unzipper#filesizeknown",

I'll try the promise OPen approach from before with this new library

Ran a cache verify just in case

Cache verified and compressed (~/.npm/_cacache):
Content verified: 3096 (184932613 bytes)
Content garbage-collected: 1530 (37022149 bytes)
Index entries: 5172
Finished in 10.615s
ZJONSSON commented 5 years ago

For reference, here is my test code that fails on master but works with zjonsson/node-unzipper#filesizeknown

const unzipper = require('unzipper');

async function main() {
  const directory = await unzipper.Open.file('./scorm-file-test.zip');
  await directory.extract({path:'/tmp/scorm-file'});
  return 'done';
 }

main().then(console.log,console.log)

In your code you could try to drop the concurrency option

franky-continu commented 5 years ago

Nice, looks awesome.

In my code with the .Open.file(....etc promise. I get a constant output of this

19:54:02.581Z ERROR continu-api:
  HiddenException: { Error: EMFILE: too many open files, open '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip'
    errno: -24,
    code: 'EMFILE',
    syscall: 'open',
    path: '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip' }

I'm going to add the function as you have it, with two arguments (file, directory)

franky-continu commented 5 years ago

So here is my code:

async function extract(file, directory) {
  const directory = await unzipper.Open.file(file);
  await directory.extract({path: directory});
  return 'done';
 }

extract(new_file_path, extract_to_directory).then(function () {
          // Checking old DOS library constructed ZIP files and (in this case) SCORM Files
          // Rename the files, moving them to the previously created directories
          fs.readdirSync(extract_to_directory).forEach(function(file) {
...... etc
...many...many...many
19:59:45.547Z ERROR continu-api:
  HiddenException: { Error: EMFILE: too many open files, open '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip'
    errno: -24,
    code: 'EMFILE',
    syscall: 'open',
    path: '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip' }
19:59:45.547Z ERROR continu-api:
  HiddenException: { Error: EMFILE: too many open files, open '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip'
    errno: -24,
    code: 'EMFILE',
    syscall: 'open',
    path: '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip' }
19:59:45.547Z ERROR continu-api:
  HiddenException: { Error: EMFILE: too many open files, open '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip'
    errno: -24,
    code: 'EMFILE',
    syscall: 'open',
    path: '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/HIPAA_-_Privacy_Rules_for_Business_Associates 2.zip' }
ZJONSSON commented 5 years ago

Yeah this is another issue which we should address. We open a new file descriptor whenever we read from the zip file and thus can run out of file descriptors. The (potentially) right solution here is to open as many descriptors as our concurrency and reuse the file descriptors for subsequent reads.

In the meantime you can try to increase the number of file descriptors available to the system or check out the graceful-fs branch npm install zjonsson/node-unzipper#graceful-fs

franky-continu commented 5 years ago

Hmmm no luck, I thought we had it but I found a hidden exception thrown.

 HiddenException: { Error: EMFILE: too many open files, open '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/5d6f8d89d8803b84902454e5/HIPAA_-_Privacy_Rules_for_Business_Associates/imsmanifest.xml'
      at Object.fs.openSync (fs.js:663:18)
      at Object.fs.readFileSync (fs.js:568:33)
      at EventEmitter.<anonymous> (/Users/franky/Workspace/continu/continu-api/lib/modules/content/scorm-manager.js:416:38)
      at EventEmitter.emit (events.js:159:13)
      at startUpload (/Users/franky/Workspace/continu/continu-api/node_modules/s3-node-client/lib/index.js:1188:12)
      at uploadLocalFile (/Users/franky/Workspace/continu/continu-api/node_modules/s3-node-client/lib/index.js:1163:9)
      at checkDoMoreWork (/Users/franky/Workspace/continu/continu-api/node_modules/s3-node-client/lib/index.js:1041:9)
      at MultipartETag.<anonymous> (/Users/franky/Workspace/continu/continu-api/node_modules/s3-node-client/lib/index.js:1304:9)
      at MultipartETag.emit (events.js:164:20)
      at endReadableNT (_stream_readable.js:1062:12)
      at process._tickCallback (internal/process/next_tick.js:152:19)
    errno: -24,
    code: 'EMFILE',
    syscall: 'open',
    path: '/var/folders/s9/clj8d12n1vv9c5x2wddm3fsc0000gn/T/5d6f8d89d8803b84902454e5/HIPAA_-_Privacy_Rules_for_Business_Associates/imsmanifest.xml' }

I'm going to try to increase the number of file descriptors available as soon as I can.

I wonder if it could be a node version. I probably have an old one:

node -v v9.3.0

ZJONSSON commented 5 years ago

Looking at the stacktrace it looks like this error is not related to unzipper. Check continu-api/lib/modules/content/scorm-manager.js:416. You might want to use fs = require('graceful-fs') instead of fs = require('fs') in your code to see if you can avoid the "too many open files"

Unrelated: readFileSync is not the best way to read files as it blocks the main thread. Better to do const readFileAsync = utils.promisify(fs.readFile); and then let body = await readFileAsync(filename) inside an async function.

Also: if you are uploading all the files separately to S3 you should not have to extract to disk first. You could simply:


const s3 = new AWS.S3({
  accessKeyId: process.env.AWS_ACCESS_KEY,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

for (let i in directory.files) {
  let file = directory.files[i];
  await s3.upload({
    Bucket: 'bucket',
    Key: file.path,
    Body: file.stream()
  })
};

Or use Bluebird.map with concurrency to upload multiple files at the same time

ZJONSSON commented 5 years ago

Hey @franky-continu just wanted to follow up, were you able to confirm that the errors were outside of the unzipper or not?

franky-continu commented 5 years ago

On it, sorry I was out.

Looking at the stacktrace it looks like this error is not related to unzipper. Check continu-api/lib/modules/content/scorm-manager.js:416. You might want to use fs = require('graceful-fs') instead of fs = require('fs') in your code to see if you can avoid the "too many open files"

You are right, but before I was using this code without fs:

async function extract(file, directory) {
  const directory = await unzipper.Open.file(file);
  await directory.extract({path: directory});
  return 'done';
 }

extract(new_file_path, extract_to_directory).then(function () {
          // Checking old DOS library constructed ZIP files and (in this case) SCORM Files
          // Rename the files, moving them to the previously created directories
          fs.readdirSync(extract_to_directory).forEach(function(file) {

But I'll try it out nonetheless.

Unrelated: readFileSync is not the best way to read files as it blocks the main thread. Better to do const readFileAsync = utils.promisify(fs.readFile); and then let body = await readFileAsync(filename) inside an async function.

true but I in the middle of a response so I need to check a few things in the directory. Mainly a DOS file naming that happens with SCORM packages where some content providers end up creating filenames with \ on them as in content\files\image.jpg where the name of image.jpg is content\files\image.jpg as in the entry, and I need to transform that into content/files/images.jpg where content/ is a directory, and files/ is a directory, and image.jpg is a file with only image.jpg . Kind of a bummer.

Also: if you are uploading all the files separately to S3 you should not have to extract to disk first. You could simply:

const s3 = new AWS.S3({
  accessKeyId: process.env.AWS_ACCESS_KEY,
  secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY
});

for (let i in directory.files) {
  let file = directory.files[i];
  await s3.upload({
    Bucket: 'bucket',
    Key: file.path,
    Body: file.stream()
  })
};

Or use Bluebird.map with concurrency to upload multiple files at the same time

I'm pretty interested in this last one. if I could manage and I think I can to upload and rename objects from content\files\image.jpg which in this last example seems like it will even be easier.

ZJONSSON commented 5 years ago

Hey @franky-continu the fixes are now in master. But I still want to understand if you are getting any errors from unzipper itself with those fixes in place? Can you please confirm and give more detail

franky-continu commented 5 years ago

Hey @ZJONSSON I haven't been able to come back to this issue, needless is to say how eager I am to be able to do so soon.

But I wanted to comment on something that happened today. I left the version pointing to latest "unzipper": "^0.10.4" and we started getting reports today that Scorm packages were being uploaded but couldn't be viewed.

When we looked at it closely we realized the unzipping had happen partially and called event end and so the file requested wasn't in S3.

I ended up downgrading to our previous version ^0.9.10 as it was a bit of an urgent item and I didn't have time right there to do a proper test and follow it more.

Anyways just wanted to inform you that, just in case you hear from others on the matter and perhaps being related to the latest change in master.

I will be getting back to this issue and to you and to this thread as soon as possible. But I just have 2 major priority items that are consuming all the time I have. Maybe on the weekend I'll have a moment. I'm really interested in solving this or at least contributing to it. And I particularly like the unzip and upload directly approach you so kindly showed me above.

trujaytim commented 4 years ago

Also hit the "Unexpected end of file Zlib" error on v 0.10.5 when using unzipper to extract files from streams, switching to use Open() etc ... as described above resolved the issue. I am happy to test the fix once pushed to npm, just let me know (can't share the zip file as it is confidential).

franky-continu commented 4 years ago

Hi @ZJONSSON sorry to be so so long. We had a big feature request and I had to move away from this issue.

I'm trying to upload directly to S3, so I upload a file to a restify that uses mutler or something like that. Leaves files.file property in req with the zip file.

then I'm trying to do this (this is a rough draft):


unzipper.Open.file(files.file.path).then(directory => {
        directory.files.forEach(async (file) => {
          try {
            console.log(file.path);
           // get the title from manifest file
            if (file.type === 'File' && file.path.indexOf('manifest.xml') !== -1) {
              var manifest_file = fs.readFileSync(file.path, 'utf8');
              var title_results = (manifest_file.match(/<title>(.*?)<\/title>/g) || []).map(function(val) {
                return val.replace(/<\/?title>/g, '');
              });
              if (title_results && title_results.length > 0) {
                scormContent.title = title_results[0];
              }
            }
            let uploaded = await s3.upload({Bucket: 'c-scorm', Key: file.path, Prefix: company + '/' + scormContent._id.toString() + '/', Body: file.stream(), ACL: 'public-read'});
            console.log(`Uploaded... ${uploaded}`);
          } catch (e) {
            console.log('Error uploading', e);
            return reject({status: 422, message: 'Error in creating scormContent.', error: e});
          }
        });
        if (scormContent.isModified()) {
          scormContent.save(function(err) {
            if (err) {
              console.log('Could not update scorm-content with manifest.xml.', err);
              return reject(err);
            }
            return resolve(scormContent);
          });
        } else {
          return resolve(scormContent);
        }
      });

But I'm afraid I'm getting the Too many files open error;

Error uploading { Error: EMFILE: too many open files, open 'HIPAA_-_Privacy_Rules_for_Business_Associates/Content/cca/lchp_01_a02_lc_enus/imsmanifest.xml'
    at Object.fs.openSync (fs.js:663:18)
    at Object.fs.readFileSync (fs.js:568:33)
    at directory.files.forEach (/Users/franky/Workspace/continu/continu-api/lib/modules/content/scorm-manager.js:362:38)
    at Array.forEach (<anonymous>)
    at unzipper.Open.file.then.directory (/Users/franky/Workspace/continu/continu-api/lib/modules/content/scorm-manager.js:358:25)

Also where I'm trying to get the title for the Scorm Package. I would need to inflate that right?

            if (file.type === 'File' && file.path.indexOf('manifest.xml') !== -1) {
             /// here some kind of inflate() extract()
              var manifest_file = fs.readFileSync(file.path, 'utf8');
              var title_results = (manifest_file.match(/<title>(.*?)<\/title>/g) || []).map(function(val) {
                return val.replace(/<\/?title>/g, '');
              });
franky-continu commented 4 years ago

It is quite late. I've tried adm-zip and was able to unzip the file.

it most likely not as fast, given that it is a full javascript implementation I believe.

franky-continu commented 4 years ago

I believe we would really benefit from having an unzipper.extractToS3(.....,....);

This is the final (for now) version, is not very optimized but it does work with every conflicting zip file so far:

ScormContent.create(body, function(err, scormContent) {
      if (err) {
        try_directory_cleanup(files.file.path);
        return reject({status: 422, message: 'Error in creating scormContent.', error: err});
      }
      if (!process.env.AWS_ACCESS_KEY_ID || !process.env.AWS_SECRET_ACCESS_KEY) {
        console.log('AWS credentials missing.');
        try_directory_cleanup(files.file.path);
        return reject({status: 422, message: 'AWS credentials missing.'});
      }

      // Rename path to account for certain IE uploads that have caused issues
      var new_file_path = files.file.path.substr(0, files.file.path.lastIndexOf('/')) + '/' + files.file.name;
      fs.renameSync(files.file.path, new_file_path);
      var extract_to_directory = files.file.path.substr(0, files.file.path.lastIndexOf('/')) + '/' + scormContent._id.toString();
      // Obtain Scorm Title from any of the possible manifest files
      var zip = new AdmZip(new_file_path);
      var zipEntries = zip.getEntries();

      zipEntries.forEach(function(zipEntry) {
        if (zipEntry.entryName.indexOf('manifest.xml') !== -1) {
          var manifest_file = zipEntry.getData().toString('utf8');
          var title_results = (manifest_file.match(/<title>(.*?)<\/title>/g) || []).map(function(val) {
            return val.replace(/<\/?title>/g, '');
          });
          if (title_results && title_results.length > 0) {
            scormContent.title = title_results[0];
          }
        }
      });

      zip.extractAllTo(extract_to_directory, true);

      // Checking old DOS library constructed ZIP files and (in this case) SCORM Files 
      // which creates an issue by which Zip Entries have backward slashes \ in the file name that
      // need to be transformed into inexistent directories
      // So, we rename the files, moving them to the previously created directories
      fs.readdirSync(extract_to_directory).forEach(function(file) {
        // Verify DOS path item \
        if (fs.lstatSync(extract_to_directory + '/' + file).isFile() && file.includes('\\')) {
          let file_dirs_and_names = file.split('\\');
          if (file_dirs_and_names.length > 1) {
            let new_directory = extract_to_directory + '/' + file_dirs_and_names.slice(0, -1).join('/');
            let normalize_file = extract_to_directory + '/' + file.replace('\\', '/');
            try {
              // try creating directories
              if (!fs.existsSync(new_directory)) {
                fs.mkdirSync(new_directory, {recursive: true});
              }
              fs.renameSync(extract_to_directory + '/' + file, normalize_file.replace('\\', '/'));
            } catch (error) {
              console.log('Error normalizing file', normalize_file.replace('\\', '/'), error);
              try_directory_cleanup(extract_to_directory);
              return reject({status: 422, message: 'Scorm (self hosted) error normalizing file:', error: error});
            }
          }
        }
      });
      console.log('finished unzipping');
      var client = s3.createClient({
        maxAsyncS3: 20, // this is the default
        s3RetryCount: 3, // this is the default
        s3RetryDelay: 1000, // this is the default
        multipartUploadThreshold: 2097152000, // this is the default (200 MB)
        multipartUploadSize: 1572864000, // this is the default (150 MB)
        s3Options: {
          accessKeyId: process.env.AWS_ACCESS_KEY_ID,
          secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
          region: 'us-west-2'
        }
      });
      var params = {
        localDir: extract_to_directory,
        deleteRemoved: false, // default false, whether to remove s3 objects
        // that have no corresponding local file.
        s3Params: {
          Bucket: 'c-scorm',
          Prefix: company + '/' + scormContent._id.toString() + '/',
          ACL: 'public-read'
        }
      };
      var uploader = client.uploadDir(params);
      uploader.on('error', function(error) {
        console.log('Scorm (self hosted) error uploading files:', error);
        try_directory_cleanup(extract_to_directory);
        return reject({status: 422, message: 'Scorm (self hosted) error uploading files:', error: error});
      });
      uploader.on('end', function() {
        console.log('Successfully uploaded scorm package');
        try_directory_cleanup(extract_to_directory);
        if (scormContent.isModified()) {
          scormContent.save(function(err) {
            if (err) {
              console.log('Could not update scorm-content with manifest.xml.', err);
            }
          });
          segmentationUtils.remove_all_label_from_privacy_fields(scormContent);
          if (body.privacy_collaborators && body.privacy_collaborators.length > 0) {
            collaborators.setAndNotify(null, scormContent);
          }
          return resolve(scormContent);
        } else {
          segmentationUtils.remove_all_label_from_privacy_fields(scormContent);
          if (body.privacy_collaborators && body.privacy_collaborators.length > 0) {
            collaborators.setAndNotify(null, scormContent);
          }
          return resolve(scormContent);
        }
      });
    }); // end

And the remove directory function

// in my code this array is much much larger
// but for the purpose of adding it here I removed most of the entries.
var sensitive_paths = [
  '/usr/lib', //    Library files searched by the linker when programs are compiled
  '/usr/local/bin', //  Common executable application files local to this system
  '/usr/sbin', //   Commands used by the super user for system administrative functions
  '/var'//  Administrative files such as log files, locks, spool files, and temporary files used by various utilities
];
var try_directory_cleanup = function(path) {
  try {
    if (!path || path === '' || path === '/' || sensitive_paths.includes(path)) {
      console.log('Invalid path sent for removal.', path);
      return;
    }
    if (!fs.existsSync(path)) {
      console.log('Inexistent path sent for removal.', path);
      return;
    }
    fs.readdirSync(path).forEach(function(file) {
      let curPath = path.endsWith('/') ? path + file : (path + '/' + file);
      if (fs.lstatSync(curPath).isDirectory()) {
        try_directory_cleanup(curPath);
      } else {
        fs.unlinkSync(curPath);
      }
    });
    fs.rmdirSync(path);
  } catch (error) {
    console.log('Error removing path:', path, error);
  }
};
sumedh-nimkarde-qatalog commented 1 year ago

@ZJONSSON I believe since this is already merged in master, we should not get the error. But, I am trying to do something like:

  const directory = await unzipper.Open.file(absoluteFilePath);
  await directory.extract({ path: absoluteFolderPath });

and getting unexpected end of file. The docs also mention that this is how we should use it. Anything else I am missing here?

The zip file size in my case is 120 mbs.

update: ignore my above comment. seems like my zip file was corrupted

aprilandjan commented 4 weeks ago

Thankst to all the contributors for this handy library. We've used it for several months. However we encounter this problem in our recent daily build artifact, without knowing the cause, since our artifact contains only some regular changes 🤔