101arrowz / fflate

High performance (de)compression in an 8kB package
https://101arrowz.github.io/fflate
MIT License
2.27k stars 79 forks source link

Add "extra fields" support #82

Closed mStirner closed 3 years ago

mStirner commented 3 years ago

Would be awesome to add support for reading&writing the "extra fields" from the specifications: https://datatracker.ietf.org/doc/html/rfc1952#page-8

 2.2. File format

      A gzip file consists of a series of "members" (compressed data
      sets).  The format of each member is specified in the following
      section.  The members simply appear one after another in the file,
      with no additional information before, between, or after them.

   2.3. Member format

      Each member has the following structure:

         +---+---+---+---+---+---+---+---+---+---+
         |ID1|ID2|CM |FLG|     MTIME     |XFL|OS | (more-->)
         +---+---+---+---+---+---+---+---+---+---+

      (if FLG.FEXTRA set)

         +---+---+=================================+
         | XLEN  |...XLEN bytes of "extra field"...| (more-->)
         +---+---+=================================+

      (if FLG.FNAME set)

         +=========================================+
         |...original file name, zero-terminated...| (more-->)
         +=========================================+

      (if FLG.FCOMMENT set)

         +===================================+
         |...file comment, zero-terminated...| (more-->)
         +===================================+

      (if FLG.FHCRC set)

         +---+---+
         | CRC16 |
         +---+---+

         +=======================+
         |...compressed blocks...| (more-->)
         +=======================+

           0   1   2   3   4   5   6   7
         +---+---+---+---+---+---+---+---+
         |     CRC32     |     ISIZE     |
         +---+---+---+---+---+---+---+---+
2.3.1.1. Extra field

         If the FLG.FEXTRA bit is set, an "extra field" is present in
         the header, with total length XLEN bytes.  It consists of a
         series of subfields, each of the form:

            +---+---+---+---+==================================+
            |SI1|SI2|  LEN  |... LEN bytes of subfield data ...|
            +---+---+---+---+==================================+

         SI1 and SI2 provide a subfield ID, typically two ASCII letters
         with some mnemonic value.  Jean-Loup Gailly
         <gzip@prep.ai.mit.edu> is maintaining a registry of subfield
         IDs; please send him any subfield ID you wish to use.  Subfield
         IDs with SI2 = 0 are reserved for future use.  The following
         IDs are currently defined:

Thanks in advance!

101arrowz commented 3 years ago

I've seen this issue but have been quite busy of late: will look into implementing it when I have spare time.

mStirner commented 3 years ago

Hey @101arrowz, no problem. I hope it will soon be calmer / more relaxed for you. :)

mStirner commented 3 years ago

@101arrowz How are you doing? Everything fine, something new on this issue/feature?

101arrowz commented 3 years ago

Do you need the ability to read this extra field from fflate as well? If so, unfortunately that's almost certainly not possible to happen, it would require too much code addition for a rarely used feature. I will give you code to read/write extra fields to an existing GZIP file. If not, let me know and I'll implement it within the next 2-3 days.

mStirner commented 3 years ago

In my opinion, reading is not that hard. You have just to parse the header a little bit other. But if you dont want to implement that, i do it on my own, if necessary.

101arrowz commented 3 years ago

I was more concerned about having to rewrite the code to support returning both the decompressed data and the headers, but I think I've come up with a solution. It should be implemented in 2-3 days, if you don't hear from me by then feel free to ping.

101arrowz commented 3 years ago

I've gotten a basic implementation working but it's added a pretty big chunk to bundle size for GZIP support (about 1kB, which is a lot when the total size for GZIP support was previously 5kB). If this were a compiled language, I could just add a feature flag and avoid adding the bloat unless it's needed, but GZIP is one of the most popular parts of the library and minimizing its bundle size is very important to me, so unfortunately I think it's not going to be released.

However, I just took another look at your initial message to me and it seems you want a CRC32 checksum of the uncompressed contents of the file, which is included by default in the spec. If so, you can just read the values between 8 and 4 bytes from the end of the file. For example:

import { gzipSync } from 'fflate';

// Data doesn't have to be from fflate, it just needs to be gzipped
const gzipData = gzipSync(someData);

// When you want to calculate the checksum:
const len = gzipData.length;
const checksum = (
  gzipData[len - 8]
  | (gzipData[len - 7] << 8)
  | (gzipData[len - 6] << 16)
  | (gzipData[len - 5] << 24)
) >>> 0;
console.log(checksum); // CRC32 of the uncompressed contents

By the way, most untar utilities that support .tar.gz files will call the zlib library to decompress the GZIP data, and thereby automatically verify that the CRC32 of the uncompressed data matches the CRC32 in the GZIP footer. So if you're doing this to avoid corruption, you don't need to worry; it's already handled for you.

So sorry for all the confusion and for taking so long on this. Let me know if you have any questions.

mStirner commented 3 years ago

First, thank for your effort and your time, i really appreciate it.

For me, the most important part was to write/reading the extra header field. Who is not familiar with the code, i cant full understand why this much extra code for some "simple" task. I know its not trivial, but it is neither rocket since. Again, im really happy that you tried and willing to implement it.

Whats so hard/difficult to return the parsed header? Whats so hard/difficult to create/write the header with the extra (or all) fields ? If its possible to return/read the complete header, then anything other should be not hard at all. I have some times ago played with gzip and its header and how to read it: https://stackoverflow.com/q/60790408/5781499

Im sorry, if this sounds rude or not polite. English is not my native language, i try to make my wishes/questions as clear/simple as possible.

Im tried to do it by myself, but its hard for me to get into the source code.

101arrowz commented 3 years ago

Just to clarify the implementation was very easy, it's the bundle size cost that isn't really acceptable. It's just that a lot of code is involved in adding support for this due to the heavy optimizations in fflate.

I can give you some code to inject your extra field into an existing gzip file if you'd like. It will be a tiny bit slower due to the need to copy the entire buffer but at least it will work.

mStirner commented 3 years ago

Would be awesome if you can provide me some code.

In terms of size, dont you think its a consideration of size/full compatibility? I know/understand that does only matter for a few people, but its part of the specification, which should in my opinion, be full implement. No matter who or how much use it :)

101arrowz commented 3 years ago
import { gzipSync } from 'fflate';
// This can only be called once on the gzipped buffer
const insertExtraFields = (data, fields) => {
  let fullLength = 2;
  for (const field in fields) {
    const si2 = field >> 8;
    if (!si2 || si2 > 255) {
      throw new Error('The extra IDs must be greater than 255 and less than 65536');
    }
    fullLength += fields[field].length + 4;
  }
  if (fullLength > 65535) {
    throw new Error(
      'The total size of the extra header is '
      + fullLength
      + ' bytes, but it must be at most 65535 bytes'
    );
  }
  const afterExtra = data.subarray(10);
  const out = new Uint8Array(10 + fullLength + afterExtra.length);
  out.set(data.subarray(0, 10));
  out.set(afterExtra, out.length-afterExtra.length);
  // Enable FEXTRA
  out[3] |= 4;
  out[10] = fullLength & 255;
  out[11] = fullLength >> 8;
  let b = 12;
  for (const field in fields) {
    out[b] = field & 255;
    out[b + 1] = field >> 8;
    const ext = fields[field];
    out[b + 2] = ext.length & 255;
    out[b + 3] = ext.length >> 8;
    out.set(ext, b + 4);
    b += ext.length + 4;
  }
  return out;
}

// The fields option is a mapping of numerical field IDs to buffers
const gzipWithExtraFields = insertExtraFields(gzipSync(yourData), {
  [0xABCD]: new Uint8Array([1,2,3,4]),
  [0xDEF0]: new TextEncoder().encode("Hello world")
})

If you need help with the reader as well let me know.

101arrowz commented 3 years ago

And by the way, the fact that extra fields can't even be read by the gzip command line tool, or the Python standard library, should indicate that it's a very obscure feature :)

mStirner commented 3 years ago

Sorry for the the long time not responding. I tried your code, i cant get it working.

marc@Workstation:~/projects/playground/test-gzip$ ll
insgesamt 40
drwxrwxr-x  4 marc marc 4096 Sep 16 18:37 ./
drwxr-xr-x 41 marc marc 4096 Sep 16 18:32 ../
-rw-rw-r--  1 marc marc  360 Sep 16 18:37 extra-header.tgz
-rw-rw-r--  1 marc marc 1615 Sep 16 18:35 index.js
drwxrwxr-x  3 marc marc 4096 Sep 16 18:35 input/
-rw-rw-r--  1 marc marc 7168 Sep 16 18:37 input.tar
drwxrwxr-x  3 marc marc 4096 Sep 16 18:37 node_modules/
-rw-rw-r--  1 marc marc  278 Sep 16 18:37 package.json
-rw-rw-r--  1 marc marc  356 Sep 16 18:37 package-lock.json
marc@Workstation:~/projects/playground/test-gzip$ tar xfvz extra-header.tgz 

gzip: stdin: invalid compressed data--format violated
tar: Child returned status 1
tar: Error is not recoverable: exiting now

grafik

I cant extract the modified gzip compressd tar file.

Code i tried (index.js):

const { gzipSync } = require('fflate');

const fs = require("fs");

// This can only be called once on the gzipped buffer
const insertExtraFields = (data, fields) => {
    let fullLength = 2;
    for (const field in fields) {
        const si2 = field >> 8;
        if (!si2 || si2 > 255) {
            throw new Error('The extra IDs must be greater than 255 and less than 65536');
        }
        fullLength += fields[field].length + 4;
    }
    if (fullLength > 65535) {
        throw new Error(
            'The total size of the extra header is '
            + fullLength
            + ' bytes, but it must be at most 65535 bytes'
        );
    }
    const afterExtra = data.subarray(10);
    const out = new Uint8Array(10 + fullLength + afterExtra.length);
    out.set(data.subarray(0, 10));
    out.set(afterExtra, out.length - afterExtra.length);
    // Enable FEXTRA
    out[3] |= 4;
    out[10] = fullLength & 255;
    out[11] = fullLength >> 8;
    let b = 12;
    for (const field in fields) {
        out[b] = field & 255;
        out[b + 1] = field >> 8;
        const ext = fields[field];
        out[b + 2] = ext.length & 255;
        out[b + 3] = ext.length >> 8;
        out.set(ext, b + 4);
        b += ext.length + 4;
    }
    return out;
}

const yourData = fs.readFileSync("input.tar");

// The fields option is a mapping of numerical field IDs to buffers
const gzipWithExtraFields = insertExtraFields(gzipSync(yourData), {
    [0xABCD]: new Uint8Array([1, 2, 3, 4]),
    [0xDEF0]: new TextEncoder().encode("Hello world")
});

fs.writeFileSync("./extra-header.tgz", gzipWithExtraFields);

rename input.gz to input.tar i needed to change the file extension to upload it here, in fact its just a *.tar file: input.gz

Output file created from the code above: extra-header.tar.gz

101arrowz commented 3 years ago

I made a mistake in the code. This should work:

const insertExtraFields = (data, fields) => {
    let fullLength = 0;
    for (const field in fields) {
        const si2 = field >> 8;
        if (!si2 || si2 > 255) {
            throw new Error('The extra IDs must be greater than 255 and less than 65536');
        }
        fullLength += fields[field].length + 4;
    }
    if (fullLength > 65535) {
        throw new Error(
            'The total size of the extra header is '
            + fullLength
            + ' bytes, but it must be at most 65535 bytes'
        );
    }
    const afterExtra = data.subarray(10);
    const out = new Uint8Array(12 + fullLength + afterExtra.length);
    out.set(data.subarray(0, 10));
    out.set(afterExtra, out.length - afterExtra.length);
    // Enable FEXTRA
    out[3] |= 4;
    out[10] = fullLength & 255;
    out[11] = fullLength >> 8;
    let b = 12;
    for (const field in fields) {
        out[b] = field & 255;
        out[b + 1] = field >> 8;
        const ext = fields[field];
        out[b + 2] = ext.length & 255;
        out[b + 3] = ext.length >> 8;
        out.set(ext, b + 4);
        b += ext.length + 4;
    }
    return out;
}
mStirner commented 3 years ago

Seems to work. I can open the file just like any other gzip compressed file. But: I think there is something wrong with they codes:

As output i get some bullshit:

{ 'Í«': '\x01\x02\x03\x04', 'ðÞ': 'Hello world' }

Code to read the extra header/fieds:

const fs = require("fs");

fs.readFile("./extra-header.tgz", (err, bytes) => {

    if (err) {
        console.log(err);
        process.exit(100);
    }

    console.log("bytes: %d", bytes.length);

    let header = bytes.slice(0, 10);
    let flags = header[3];
    let eFlags = header[8];
    let OS = header[9];

    console.log("Is tarfile:", header[0] === 31 && header[1] === 139);
    console.log("compress method:", header[2] === 8 ? "deflate" : "other");
    console.log("M-Date: %d%d%d%d", bytes[4], bytes[5], bytes[6], bytes[7]);
    console.log("OS", OS);
    console.log("flags", flags);
    console.log();

    // bitwise operation on header flags
    const FLAG_RESERVED_3 = (bytes[3] >> 7) & 1;
    const FLAG_RESERVED_2 = (bytes[3] >> 6) & 1;
    const FLAG_RESERVED_1 = (bytes[3] >> 5) & 1;
    const FLAG_COMMENT = (bytes[3] >> 4) & 1;
    const FLAG_NAME = (bytes[3] >> 3) & 1;
    const FLAG_EXTRA = (bytes[3] >> 2) & 1;
    const FLAG_CRC = (bytes[3] >> 1) & 1;
    const FLAG_TEXT = (bytes[3] >> 0) & 1;

    console.log("FLAG_RESERVED_3", FLAG_RESERVED_3);
    console.log("FLAG_RESERVED_2", FLAG_RESERVED_2);
    console.log("FLAG_RESERVED_1", FLAG_RESERVED_1);
    console.log("FLAG_COMMENT", FLAG_COMMENT);
    console.log("FLAG_NAME", FLAG_NAME);
    console.log("FLAG_EXTRA", FLAG_EXTRA);
    console.log("FLAG_CRC", FLAG_CRC);
    console.log("FLAG_TEXT", FLAG_TEXT);
    console.log();

    if (FLAG_EXTRA) {

        let len1 = bytes[10];
        let len2 = bytes[11];
        let length = len1 + (len2 * 256);

        console.log("Extra header lenght", length);

        // ------------------------

        let obj = {};

        let offset = 0;
        let extra = bytes.slice(12, 12 + length);

        while (length > offset) {

            let s1 = extra[offset + 0];
            let s2 = extra[offset + 1];
            let l1 = extra[offset + 2];
            let l2 = extra[offset + 3];

            let length = l1 + (l2 * 256);
            let end = offset + 4 + length;

            // read & store sub-header in object
            obj[`${String.fromCharCode(s1)}${String.fromCharCode(s2)}`] = extra.slice(offset + 4, end).toString();

            // set offset
            offset = end;

        }

        console.log(obj)

    } else {

        console.log("No extra header found")

    }

});

The code works, i tested it with this perl script i found: https://stackoverflow.com/questions/47894136/how-to-create-a-gzip-file-w-fextra-fcomment-fields

use IO::Compress::Gzip qw(gzip $GzipError);

gzip \"payload" => "./extra-fields.tgz", 
      Comment    => "This is a comment", 
      ExtraField => [ 
         "cf" => "1", 
        "cd" => "mySuperString",
       "ol" => "HelloWorld"
     ] 
  or die "Cannot create gzip file: $GzipError" ;

If i try to use a string as IDs, i get this error from your code:

Error: The extra IDs must be greater than 255 and less than 65536

Why must these be IDs/Integer, and not: "typically two ASCII letters with some mnemonic value" as described in the specificiation?

101arrowz commented 3 years ago

The reason it looks like garbage is because the IDs are not ASCII values. You actually can use ASCII letters with my code but the specification does not limit the IDs to ASCII so I made it more generic by letting it be a number. Also the second "character" (SI2, the most significant byte in the ID) cannot be zero because that is reserved. So it's just easier from a coding point of view to take a number, but obviously you can use ASCII as well:

const asciiToID = str => {
  if (str.length != 2) throw new TypeError('extra ID must be two characters long');
  return str.charCodeAt(0) | (str.charCodeAt(1) << 8);
}

const gzipWithExtraFields = insertExtraFields(gzipSync(yourData), {
    [asciiToID('ab')]: new Uint8Array([1, 2, 3, 4]),
    [asciiToID('cd')]: new TextEncoder().encode("Hello world")
});
mStirner commented 3 years ago

Thank you very much :) Issue can be closed if want to.