Closed mStirner closed 3 years ago
I've seen this issue but have been quite busy of late: will look into implementing it when I have spare time.
Hey @101arrowz, no problem. I hope it will soon be calmer / more relaxed for you. :)
@101arrowz How are you doing? Everything fine, something new on this issue/feature?
Do you need the ability to read this extra field from fflate
as well? If so, unfortunately that's almost certainly not possible to happen, it would require too much code addition for a rarely used feature. I will give you code to read/write extra fields to an existing GZIP file. If not, let me know and I'll implement it within the next 2-3 days.
In my opinion, reading is not that hard. You have just to parse the header a little bit other. But if you dont want to implement that, i do it on my own, if necessary.
I was more concerned about having to rewrite the code to support returning both the decompressed data and the headers, but I think I've come up with a solution. It should be implemented in 2-3 days, if you don't hear from me by then feel free to ping.
I've gotten a basic implementation working but it's added a pretty big chunk to bundle size for GZIP support (about 1kB, which is a lot when the total size for GZIP support was previously 5kB). If this were a compiled language, I could just add a feature flag and avoid adding the bloat unless it's needed, but GZIP is one of the most popular parts of the library and minimizing its bundle size is very important to me, so unfortunately I think it's not going to be released.
However, I just took another look at your initial message to me and it seems you want a CRC32 checksum of the uncompressed contents of the file, which is included by default in the spec. If so, you can just read the values between 8 and 4 bytes from the end of the file. For example:
import { gzipSync } from 'fflate';
// Data doesn't have to be from fflate, it just needs to be gzipped
const gzipData = gzipSync(someData);
// When you want to calculate the checksum:
const len = gzipData.length;
const checksum = (
gzipData[len - 8]
| (gzipData[len - 7] << 8)
| (gzipData[len - 6] << 16)
| (gzipData[len - 5] << 24)
) >>> 0;
console.log(checksum); // CRC32 of the uncompressed contents
By the way, most untar utilities that support .tar.gz
files will call the zlib
library to decompress the GZIP data, and thereby automatically verify that the CRC32 of the uncompressed data matches the CRC32 in the GZIP footer. So if you're doing this to avoid corruption, you don't need to worry; it's already handled for you.
So sorry for all the confusion and for taking so long on this. Let me know if you have any questions.
First, thank for your effort and your time, i really appreciate it.
For me, the most important part was to write/reading the extra header field. Who is not familiar with the code, i cant full understand why this much extra code for some "simple" task. I know its not trivial, but it is neither rocket since. Again, im really happy that you tried and willing to implement it.
Whats so hard/difficult to return the parsed header? Whats so hard/difficult to create/write the header with the extra (or all) fields ? If its possible to return/read the complete header, then anything other should be not hard at all. I have some times ago played with gzip and its header and how to read it: https://stackoverflow.com/q/60790408/5781499
Im sorry, if this sounds rude or not polite. English is not my native language, i try to make my wishes/questions as clear/simple as possible.
Im tried to do it by myself, but its hard for me to get into the source code.
Just to clarify the implementation was very easy, it's the bundle size cost that isn't really acceptable. It's just that a lot of code is involved in adding support for this due to the heavy optimizations in fflate
.
I can give you some code to inject your extra field into an existing gzip file if you'd like. It will be a tiny bit slower due to the need to copy the entire buffer but at least it will work.
Would be awesome if you can provide me some code.
In terms of size, dont you think its a consideration of size/full compatibility? I know/understand that does only matter for a few people, but its part of the specification, which should in my opinion, be full implement. No matter who or how much use it :)
import { gzipSync } from 'fflate';
// This can only be called once on the gzipped buffer
const insertExtraFields = (data, fields) => {
let fullLength = 2;
for (const field in fields) {
const si2 = field >> 8;
if (!si2 || si2 > 255) {
throw new Error('The extra IDs must be greater than 255 and less than 65536');
}
fullLength += fields[field].length + 4;
}
if (fullLength > 65535) {
throw new Error(
'The total size of the extra header is '
+ fullLength
+ ' bytes, but it must be at most 65535 bytes'
);
}
const afterExtra = data.subarray(10);
const out = new Uint8Array(10 + fullLength + afterExtra.length);
out.set(data.subarray(0, 10));
out.set(afterExtra, out.length-afterExtra.length);
// Enable FEXTRA
out[3] |= 4;
out[10] = fullLength & 255;
out[11] = fullLength >> 8;
let b = 12;
for (const field in fields) {
out[b] = field & 255;
out[b + 1] = field >> 8;
const ext = fields[field];
out[b + 2] = ext.length & 255;
out[b + 3] = ext.length >> 8;
out.set(ext, b + 4);
b += ext.length + 4;
}
return out;
}
// The fields option is a mapping of numerical field IDs to buffers
const gzipWithExtraFields = insertExtraFields(gzipSync(yourData), {
[0xABCD]: new Uint8Array([1,2,3,4]),
[0xDEF0]: new TextEncoder().encode("Hello world")
})
If you need help with the reader as well let me know.
And by the way, the fact that extra fields can't even be read by the gzip
command line tool, or the Python standard library, should indicate that it's a very obscure feature :)
Sorry for the the long time not responding. I tried your code, i cant get it working.
marc@Workstation:~/projects/playground/test-gzip$ ll
insgesamt 40
drwxrwxr-x 4 marc marc 4096 Sep 16 18:37 ./
drwxr-xr-x 41 marc marc 4096 Sep 16 18:32 ../
-rw-rw-r-- 1 marc marc 360 Sep 16 18:37 extra-header.tgz
-rw-rw-r-- 1 marc marc 1615 Sep 16 18:35 index.js
drwxrwxr-x 3 marc marc 4096 Sep 16 18:35 input/
-rw-rw-r-- 1 marc marc 7168 Sep 16 18:37 input.tar
drwxrwxr-x 3 marc marc 4096 Sep 16 18:37 node_modules/
-rw-rw-r-- 1 marc marc 278 Sep 16 18:37 package.json
-rw-rw-r-- 1 marc marc 356 Sep 16 18:37 package-lock.json
marc@Workstation:~/projects/playground/test-gzip$ tar xfvz extra-header.tgz
gzip: stdin: invalid compressed data--format violated
tar: Child returned status 1
tar: Error is not recoverable: exiting now
I cant extract the modified gzip compressd tar file.
Code i tried (index.js):
const { gzipSync } = require('fflate');
const fs = require("fs");
// This can only be called once on the gzipped buffer
const insertExtraFields = (data, fields) => {
let fullLength = 2;
for (const field in fields) {
const si2 = field >> 8;
if (!si2 || si2 > 255) {
throw new Error('The extra IDs must be greater than 255 and less than 65536');
}
fullLength += fields[field].length + 4;
}
if (fullLength > 65535) {
throw new Error(
'The total size of the extra header is '
+ fullLength
+ ' bytes, but it must be at most 65535 bytes'
);
}
const afterExtra = data.subarray(10);
const out = new Uint8Array(10 + fullLength + afterExtra.length);
out.set(data.subarray(0, 10));
out.set(afterExtra, out.length - afterExtra.length);
// Enable FEXTRA
out[3] |= 4;
out[10] = fullLength & 255;
out[11] = fullLength >> 8;
let b = 12;
for (const field in fields) {
out[b] = field & 255;
out[b + 1] = field >> 8;
const ext = fields[field];
out[b + 2] = ext.length & 255;
out[b + 3] = ext.length >> 8;
out.set(ext, b + 4);
b += ext.length + 4;
}
return out;
}
const yourData = fs.readFileSync("input.tar");
// The fields option is a mapping of numerical field IDs to buffers
const gzipWithExtraFields = insertExtraFields(gzipSync(yourData), {
[0xABCD]: new Uint8Array([1, 2, 3, 4]),
[0xDEF0]: new TextEncoder().encode("Hello world")
});
fs.writeFileSync("./extra-header.tgz", gzipWithExtraFields);
rename input.gz to input.tar i needed to change the file extension to upload it here, in fact its just a *.tar file: input.gz
Output file created from the code above: extra-header.tar.gz
I made a mistake in the code. This should work:
const insertExtraFields = (data, fields) => {
let fullLength = 0;
for (const field in fields) {
const si2 = field >> 8;
if (!si2 || si2 > 255) {
throw new Error('The extra IDs must be greater than 255 and less than 65536');
}
fullLength += fields[field].length + 4;
}
if (fullLength > 65535) {
throw new Error(
'The total size of the extra header is '
+ fullLength
+ ' bytes, but it must be at most 65535 bytes'
);
}
const afterExtra = data.subarray(10);
const out = new Uint8Array(12 + fullLength + afterExtra.length);
out.set(data.subarray(0, 10));
out.set(afterExtra, out.length - afterExtra.length);
// Enable FEXTRA
out[3] |= 4;
out[10] = fullLength & 255;
out[11] = fullLength >> 8;
let b = 12;
for (const field in fields) {
out[b] = field & 255;
out[b + 1] = field >> 8;
const ext = fields[field];
out[b + 2] = ext.length & 255;
out[b + 3] = ext.length >> 8;
out.set(ext, b + 4);
b += ext.length + 4;
}
return out;
}
Seems to work. I can open the file just like any other gzip compressed file. But: I think there is something wrong with they codes:
As output i get some bullshit:
{ 'Í«': '\x01\x02\x03\x04', 'ðÞ': 'Hello world' }
Code to read the extra header/fieds:
const fs = require("fs");
fs.readFile("./extra-header.tgz", (err, bytes) => {
if (err) {
console.log(err);
process.exit(100);
}
console.log("bytes: %d", bytes.length);
let header = bytes.slice(0, 10);
let flags = header[3];
let eFlags = header[8];
let OS = header[9];
console.log("Is tarfile:", header[0] === 31 && header[1] === 139);
console.log("compress method:", header[2] === 8 ? "deflate" : "other");
console.log("M-Date: %d%d%d%d", bytes[4], bytes[5], bytes[6], bytes[7]);
console.log("OS", OS);
console.log("flags", flags);
console.log();
// bitwise operation on header flags
const FLAG_RESERVED_3 = (bytes[3] >> 7) & 1;
const FLAG_RESERVED_2 = (bytes[3] >> 6) & 1;
const FLAG_RESERVED_1 = (bytes[3] >> 5) & 1;
const FLAG_COMMENT = (bytes[3] >> 4) & 1;
const FLAG_NAME = (bytes[3] >> 3) & 1;
const FLAG_EXTRA = (bytes[3] >> 2) & 1;
const FLAG_CRC = (bytes[3] >> 1) & 1;
const FLAG_TEXT = (bytes[3] >> 0) & 1;
console.log("FLAG_RESERVED_3", FLAG_RESERVED_3);
console.log("FLAG_RESERVED_2", FLAG_RESERVED_2);
console.log("FLAG_RESERVED_1", FLAG_RESERVED_1);
console.log("FLAG_COMMENT", FLAG_COMMENT);
console.log("FLAG_NAME", FLAG_NAME);
console.log("FLAG_EXTRA", FLAG_EXTRA);
console.log("FLAG_CRC", FLAG_CRC);
console.log("FLAG_TEXT", FLAG_TEXT);
console.log();
if (FLAG_EXTRA) {
let len1 = bytes[10];
let len2 = bytes[11];
let length = len1 + (len2 * 256);
console.log("Extra header lenght", length);
// ------------------------
let obj = {};
let offset = 0;
let extra = bytes.slice(12, 12 + length);
while (length > offset) {
let s1 = extra[offset + 0];
let s2 = extra[offset + 1];
let l1 = extra[offset + 2];
let l2 = extra[offset + 3];
let length = l1 + (l2 * 256);
let end = offset + 4 + length;
// read & store sub-header in object
obj[`${String.fromCharCode(s1)}${String.fromCharCode(s2)}`] = extra.slice(offset + 4, end).toString();
// set offset
offset = end;
}
console.log(obj)
} else {
console.log("No extra header found")
}
});
The code works, i tested it with this perl script i found: https://stackoverflow.com/questions/47894136/how-to-create-a-gzip-file-w-fextra-fcomment-fields
use IO::Compress::Gzip qw(gzip $GzipError);
gzip \"payload" => "./extra-fields.tgz",
Comment => "This is a comment",
ExtraField => [
"cf" => "1",
"cd" => "mySuperString",
"ol" => "HelloWorld"
]
or die "Cannot create gzip file: $GzipError" ;
If i try to use a string as IDs, i get this error from your code:
Error: The extra IDs must be greater than 255 and less than 65536
Why must these be IDs/Integer, and not: "typically two ASCII letters with some mnemonic value" as described in the specificiation?
The reason it looks like garbage is because the IDs are not ASCII values. You actually can use ASCII letters with my code but the specification does not limit the IDs to ASCII so I made it more generic by letting it be a number. Also the second "character" (SI2, the most significant byte in the ID) cannot be zero because that is reserved. So it's just easier from a coding point of view to take a number, but obviously you can use ASCII as well:
const asciiToID = str => {
if (str.length != 2) throw new TypeError('extra ID must be two characters long');
return str.charCodeAt(0) | (str.charCodeAt(1) << 8);
}
const gzipWithExtraFields = insertExtraFields(gzipSync(yourData), {
[asciiToID('ab')]: new Uint8Array([1, 2, 3, 4]),
[asciiToID('cd')]: new TextEncoder().encode("Hello world")
});
Thank you very much :) Issue can be closed if want to.
Would be awesome to add support for reading&writing the "extra fields" from the specifications: https://datatracker.ietf.org/doc/html/rfc1952#page-8
Thanks in advance!