circulosmeos / gztool

extract random-positioned data from gzip files with no penalty, including gzip tailing like with 'tail -f' !
https://circulosmeos.wordpress.com/2019/08/11/continuous-tailing-of-a-gzip-file-efficiently/
133 stars 12 forks source link

request: byte-aligned index blocks for low-tech zlib inflater #18

Closed jnorthrup closed 1 year ago

jnorthrup commented 1 year ago

I am able to read the created index files using a jdk client and to prime the window as needed and set up the streams to the needed positions.

the wall i run into is the inflatePrime zlib function being absent from non-c libraries which is true among at least 3 ports including the official Oracle one.

the occurrence of non-zero bits in the index is roughly... 7 in 8

shown below: image

in gzindex the indexes are not stored to disk, it's just a minimum unit test of what gztool does. the point struct stores the first 2 offsets in bits, not bytes.

i modified the loop conditionals of gzindex as shown below to change the input window to 1 and keep iterating the loop until arriving at byte aligned block boundary. i'm guessing this makes the block boundary slightly stochastic, up to an average of 4 bytes variance. with gztool this isn't a simple modification.

diff --git a/gzindex.c b/gzindex.c
--- a/gzindex.c (revision f1b7696c1e4757a7201009a2f3e02ed9e3536a56)
+++ b/gzindex.c (revision 662eb8434ed5c3d18e4673621aafd9e0feb415bf)
@@ -207,6 +207,7 @@
     unsigned char *out, *out2;
     z_stream strm;
     unsigned char in[16384];
+size_t input_stride = sizeof(in); 

     /* position input file */
     ret = fseeko(gz, offset, SEEK_SET);
@@ -273,7 +274,7 @@
         do {
             /* if needed, get more input data */
             if (strm.avail_in == 0) {
-                strm.avail_in = fread(in, 1, sizeof(in), gz);
+                strm.avail_in = fread(in, 1, input_stride, gz);
                 if (ferror(gz)) {
                     (void)inflateEnd(&strm);
                     free(list);
@@ -304,6 +305,12 @@

             /* if at a block boundary, note the location of the header */
             if (strm.data_type & 128) {
+            out_alignment = (pos - strm.avail_in) & 7;
+            if (out_alignment) {
+                input_stride = 1;
+            } else {
+                input_stride = sizeof(in);
+            }
                 head = ((pos - strm.avail_in) << 3) - (strm.data_type & 63);
                 last = strm.data_type & 64; /* true at end of last block */
             }
...
        } while (strm.avail_out != 0 && !last &&out_alignment ); //keeps reading 1 byte at end of block-read until alignment 
circulosmeos commented 1 year ago

I think I can study the addition of a new argument to adjust index points to a zero bit boundary... It should not be as difficult as it seems

jnorthrup commented 1 year ago

i posted a PR to zran.c in the zlib repo. it required somethign like 5 comments and a few lines of code, i removed the bits which didn't go over too well with Mark, but he rewrote it and published a new version of zran.c

https://github.com/madler/zlib/pull/801/files

On Tue, Apr 18, 2023 at 12:46 AM circulosmeos @.***> wrote:

I think I can study the addition of a new argument to adjust index points to a zero bit boundary... It should not be as difficult as it seems

— Reply to this email directly, view it on GitHub https://github.com/circulosmeos/gztool/issues/18#issuecomment-1511729228, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAR6KRLYXDEWUWL4G4MBALXBVXVJANCNFSM6AAAAAAW3AFW7Q . You are receiving this because you authored the thread.Message ID: @.***>

jnorthrup commented 1 year ago

gztool is feeding multiple usecases through the same code flows, I understand the added complexity.

I can get my needs met adequately by making a sort of jzran (actually kzran, in kotlin) that is the c code in ffi https://github.com/jnorthrup/TrikeShed/blob/f7f31058c99f8ab2a7aa07c3bf834d76850b0d73/src/nativeMain/kotlin/borg/trikeshed/tilting/zran/kzran.kt#L113

The urgency is waay down on the issue from before.

On Tue, Apr 18, 2023 at 10:46 AM Jim Northrup @.***> wrote:

i posted a PR to zran.c in the zlib repo. it required somethign like 5 comments and a few lines of code, i removed the bits which didn't go over too well with Mark, but he rewrote it and published a new version of zran.c

https://github.com/madler/zlib/pull/801/files

On Tue, Apr 18, 2023 at 12:46 AM circulosmeos @.***> wrote:

I think I can study the addition of a new argument to adjust index points to a zero bit boundary... It should not be as difficult as it seems

— Reply to this email directly, view it on GitHub < https://github.com/circulosmeos/gztool/issues/18#issuecomment-1511729228>, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AAAR6KRLYXDEWUWL4G4MBALXBVXVJANCNFSM6AAAAAAW3AFW7Q

. You are receiving this because you authored the thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/circulosmeos/gztool/issues/18#issuecomment-1512356297, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAR6KQFMOGWHH4BJLFOF33XBX567ANCNFSM6AAAAAAW3AFW7Q . You are receiving this because you are subscribed to this thread.Message ID: @.***>

circulosmeos commented 1 year ago

I've just released v1.6.0, which implements -Zoption to create index points always adjusted to a clean byte boundary.

jnorthrup commented 1 year ago

this looks like a good thing. the window can be inflated simply, followed by a input stream seek, and resume inflating with the simplest possible inflater

jnorthrup commented 1 year ago

closing the issue, assuming bugs with -Z , if any can be some other future issue.