circulosmeos / gztool

extract random-positioned data from gzip files with no penalty, including gzip tailing like with 'tail -f' !
https://circulosmeos.wordpress.com/2019/08/11/continuous-tailing-of-a-gzip-file-efficiently/
133 stars 12 forks source link
bgzf bgzip compressed-files compression concatenate-files decompression gzip gzip-compression gzip-data gzip-decompression gzip-format gzip-stream gzipped-files indexing inflate zlib zlib-decompression-library

gztool

GZIP files indexer, compressor and data retriever. Create small indexes for gzipped files and use them for quick and random data extraction. No more waiting when the end of a 10 GiB gzip is needed!

See Installation for Ubuntu, the Release page for executables for your platform, and Compilation in case you want to compile the tool.

Also, a magic file to correctly identify gztool's index files with linux file command is provided: you can append it (or overwrite your empty) /etc/magic file or append/copy it to your home directory as ~/.magic (note the point prepending the name).

Considerations

Nonetheless, note that gztool creates index interleaved with extraction of data (-b), so in the practice there's no waste of time. Note that if extraction of data or just index creation are stopped at any moment, gztool will reuse the remaining index on the next run over the same data, so time consumption is always minimized.

Also gztool can monitor a growing gzip file (for example, a log created by rsyslog directly in gzip format) and generate the index on-the-fly, thus reducing in the practice to zero the time of index creation. See the -S (Supervise) option.

Note that the size of the index depends on the span between index points on the uncompressed stream - by default it is 10 MiB: this means that when retrieving randomly situated data only 10/2 = 5 MiB of uncompressed data must be decompressed (on average) no matter the size of the gzip file - which is a fairly low value!
The span between index points can be adjusted with -s (span) option (the minimum is -s 1 or 1 MiB).
For example, a span of -s 20 will create indexes half the size, and -s 5 will create indexes twice bigger.

Background

By default gzip-compressed files cannot be accessed in random mode: any byte required at position N requires the complete gzip file to be decompressed from the beginning to the N byte.
Nonetheless Mark Adler, the author of zlib, provided years ago a cryptic file named zran.c that creates an "index" of "windows" filled with 32 kiB of uncompressed data at different positions along the un/compressed file, which can be used to initialize the zlib library and make it behave as if compressed data begin there.

gztool builds upon zran.c to provide a useful command line tool.
Also, some optimizations and brand new behaviours have been added:

Installation

Compilation

zlib.a archive library is needed in order to compile gztool: the package providing it is actually zlib1g-dev (this may vary on your system):

$ sudo apt-get install zlib1g-dev

$ gcc -O3 -o gztool gztool.c -lz -lm

If you wish you can use autoconf to check the dependencies, build and test gztool:

$ autoreconf && ./configure && make check

This will produce a binary in gztool.

Compilation in Windows

Compilation in Windows is possible using gcc for Windows and compiling the original zlib code to obtain the needed archive library libz.a.
Please, note that executables for different platforms are provided on the Release page.

Usage

  gztool (v1.6.1)
  GZIP files indexer, compressor and data retriever.
  Create small indexes for gzipped files and use them
  for quick and random-positioned data extraction.
  No more waiting when the end of a 10 GiB gzip is needed!
  //github.com/circulosmeos/gztool (by Roberto S. Galende)

  $ gztool [-[abLnsv] #] [-[1..9]AcCdDeEfFhilpPrRStTwWxXzZ|u[cCdD]] [-I <INDEX>] <FILE>...

  Note that actions `-bcStT` proceed to an index file creation (if
  none exists) INTERLEAVED with data flow. As data flow and
  index creation occur at the same time there's no waste of time.
  Also you can interrupt actions at any moment and the remaining
  index file will be reused (and completed if necessary) on the
  next gztool run over the same data.

 -[1..9]: Factor of compression to use with `-[c|u[cC]]`, from
     best speed (`-1`) to best compression (`-9`). Default is `-6`.
 -a #: Await # seconds between reads when `-[ST]|Ec`. Default is 4 s.
 -A: Modifier for `-[rR]` to indicate the range of bytes/lines in
     absolute values, instead of the default incremental values.
 -b #: extract data from indicated uncompressed byte position of
     gzip file (creating or reusing an index file) to STDOUT.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -C: always create a 'Complete' index file, ignoring possible errors.
 -c: compress a file like with gzip, creating an index at the same time.
 -d: decompress a file like with gzip.
 -D: do not delete original file when using `-[cd]`.
 -e: if multiple files are indicated, continue on error (if any).
 -E: end processing on first GZIP end of file marker at EOF.
     Nonetheless with `-c`, `-E` waits for more data even at EOF.
 -f: force file overwriting if index file already exists.
 -F: force index creation/completion first, and then action: if
     `-F` is not used, index is created interleaved with actions.
 -h: print brief help; `-hh` prints this help.
 -i: create index for indicated gzip file (For 'file.gz' the default
     index file name will be 'file.gzi'). This is the default action.
 -I string: index file name will be the indicated string.
 -l: check and list info contained in indicated index file.
     `-ll` and `-lll` increase the level of index checking detail.
 -L #: extract data from indicated uncompressed line position of
     gzip file (creating or reusing an index file) to STDOUT.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -n #: indicates that the first byte on compressed input is #, not 1,
     and so truncated compressed inputs can be used if an index exists.
 -p: indicates that the gzip input stream may be composed of various
     incorrectly terminated GZIP streams, and so then a careful
     Patching of the input may be needed to extract correct data.
 -P: like `-p`, but when used with `-[ST]` implies that checking
     for errors in stream is made as quick as possible as the gzip file
     grows. Warning: this may lead to some errors not being patched.
 -r #: (range): Number of bytes to extract when using `-[bL]`.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -R #: (Range): Number of lines to extract when using `-[bL]`.
     Accepts '0', '0x', and suffixes 'kmgtpe' (^10) 'KMGTPE' (^2).
 -s #: span in uncompressed MiB between index points when
     creating the index. By default is `10`.
 -S: Supervise indicated file: create a growing index,
     for a still-growing gzip file. (`-i` is implicit).
 -t: tail (extract last bytes) to STDOUT on indicated gzip file.
 -T: tail (extract last bytes) to STDOUT on indicated still-growing
     gzip file, and continue Supervising & extracting to STDOUT.
 -u [cCdD]: utility to compress (`-u c`) or decompress (`-u d`)
          zlib-format files to STDOUT. Use `-u C` and `-u D`
          to manage raw compressed files. No index involved.
 -v #: output verbosity: from `0` (none) to `5` (nuts).
     Default is `1` (normal).
 -w: wait for creation if file doesn't exist, when using `-[cdST]`.
 -W: do not Write index to disk. But if one is already available
     read and use it. Useful if the index is still under an `-S` run.
 -x: create index with line number information (win/*nix compatible).
     (Index counts last line even w/o newline char (`wc` does not!)).
     This is implicit unless `-X` or `-z` are indicated.
 -X: like `-x`, but newline character is '\r' (old mac).
 -z: create index without line number information.
 -Z: adjust index points to a byte boundary: no previous byte needed.

  EXAMPLE: Extract data from 1 GiB byte (byte 2^30) on,
  from `myfile.gz` to the file `myfile.txt`. Also gztool will
  create (or reuse, or complete) an index file named `myfile.gzi`:
  $ gztool -b 1G myfile.gz > myfile.txt

Please, note that STDOUT is used for data extraction with -bLtTu modifiers.

Examples of use

This creates, as usual, the index file compressed_text_file.gzi. In order to not create it, -W (do not Write index) can be used:

    $ gztool -pWb0 compressed_text_file.gz

Note that -p can require up to twice the time for decompression, because it performs two decompression processes: the usual one, and another one that is performed in advance of the usual and which is the one that detects errors, marks them, and finds new entry points to end/begin the decompression circumventing the problems.

Note also that these corrupted-gzip-files should be always decompressed with -p parameter, even if a gztool index file exists for them, because the index file stores entry points, but does not store where do errors occur in the gzip file. That said, if the -[bL] point of extraction is beyond the point(s) of error in the gzip file and an index file exists, then the decompression can proceed fine without -p, as the index points stored in the index file are always clean.

The same applies to -S though in this case there's no output, as only the index is being constructed:

    $ gztool -PS application_log.gz
    ...
    PATCHING: Gzip stream error found @ 15745693 Byte.
    PATCHING WARNING:
        Data extracted around the patching point may overlap.
    PATCHING: New valid gzip full flush found @ 15700759 Byte.
    ...

If gztool finds the gzip file companion of the index file, some statistics are shown, like the index/gzip size ratio, or the ratio of compression of the gzip file.

Also, if the gzip is complete, the size of the uncompressed data is shown. This number is interesting if the gzip file is bigger than 4 GiB, in which case gunzip -l cannot correctly calculate it as it is limited to a 32 bit counter, or if the gzip file is in bgzip format, in which case gunzip -l would only show data about the first block (< 64 kiB).

In this latter case only a pair of index+gzip filenames can be indicated with each use.

Take into account that, as shown, the first byte of the truncated gzip_filename.gz file is numbered 100001, that is, the bytes retain the order number in which they appear in the original file (that's the reason why it is not the 1 byte).

Please, note that index point positions at index file may require also the previous byte to be available in the truncated gzip file, as a gzip stream is not byte-rounded but a stream of pure bits. Thus if you're thinking on truncating a gzip file, please do it always at least by one byte before the indicated index point in the gzip - as said, it may not be needed, but in 7 of 8 cases it is needed. Another option is to use -Z when creating the index, as indicated below.

-Z exists since gztool v1.6.0.

Index file format

Index files are created by default with extension '.gzi' appended to the original file name of the gzipped file:

filename.gz     ->     filename.gzi

If the original file doesn't have ".gz" extension, ".gzi" will be appended - for example:

filename.tgz     ->     filename.tgz.gzi

There's a special header to mark ".gzi" files as index files usable for this app:

+-----------------+-----------------+
|   0x0 64 bits   |    "gzipindx"   |     ~     16 bytes = 128 bits
+-----------------+-----------------+

This is "version 0" header, that is, it does not contain lines information. The header indicating that the index contains lines information is a "version 1" header, differing only in the capital "X" (each index registry point in this case contains an additional 64-bit counter to take lines into account). Next versions (if any) would use "gzipindx" string with lower and capital letters following a binary counting as if they were binary digits.

+-----------------+-----------------+
|   0x0 64 bits   |    "gzipindX"   |     version 1 header (index was created with `-[xX]` parameter)
+-----------------+-----------------+

Note that this header has been built so that this format will be "compatible" with index files generated for bgzip-compressed files. bgzip files are totally compatible with gzip: they've just been made so every 64 kiB of uncompressed data the zlib library is restart, so they are composed of independent gzipped blocks one after another. The bgzip command can create index files for bgzipped files in less time and with much less space than with this tool as they're already almost random-access-capable. The first 64-bit-long of bgzip files is the count of index pairs that are next, so with this 0x0 header gztool-index-files can be ignored by bgzip command and so bgzipped and gzipped files and their indexes could live in the same folder without collision.

All numbers are stored in big-endian byte order (platform independently). Big-endian numbers are easier to interpret than little-endian ones when inspecting them with an hex editor (like od for example).

Next, and almost one-to-one pass of struct access is serialize to the file. access->have and access->size are both written even though they'd always be equal. If the index file is generated with -S or -T on a still-growing gzip file (or somehow the index hasn't been completed because the gzip data was still incomplete), the values on disk for access->have and access->size will be respectively 0x0..0 and "number of actual index points written" (both uint64_t) to mark this fact. access->size MAY be UINT64_MAX to avoid the need to write this value as the number of index points are added to the file: as the index is incremental the number of points can be determined by reading the index until EOF. access->have MAY also be greater than zero but lower than access->size: this can occur when an already finished index is increased with new points (source gzip may have grown) - in this case this is also considered an incomplete index: when the index be correctly closed both numbers will have the same value (a Ctrl+C before would leave the index "incomplete", but usable for next runs in which it can be finished).

After that, comes all the struct point data. As previously said, windows are compressed so a previous register (32 bits) with their length is needed. Note that an index point with a window of size zero is possible.

After all the struct point structures data, the original uncompressed data size of the gzipped file is stored (64 bits).

Please note that not all stored numbers are 64-bit long. This is because some counters will always fit in less length. Refer to code.

With 64 bit long numbers, the index could potentially manage files up to 2^64 = 16 EiB (16 777 216 TiB).

Line number counting

Regarding line number counting (-[xX]), note that gztool's index counts last line in uncompressed data even if the last char isn't a newline char - whilst wc command will not count it in this case!. Nonetheless, line counting when extracting data with -[bLtT] does follow wc convention - this is in order to not obtain different (+/-1) results reading gztool output info and wc counts.

Also note that line counting when a gzip file / index file aren't still complete, always starts in 1. This is coherent with the previous statement, and it's also reasonable because if number counting is activated (-[xX]) there'll presumably be lines beautifully ending with a new line char (or chars in case of Windows: CR+LF) somewhere in the string.

magic file

A magic file to correctly identify gztool's index files with linux file command is provided: you can append it (or overwrite your empty) /etc/magic file or append/copy it to your home directory as ~/.magic (note the point prepending the name).

Other tools which try to provide random access to gzipped files

Other interesting links

Version

This version is v1.6.1.

Please, read the Disclaimer. In case of any errors, please open an issue.

License

A work by Roberto S. Galende
distributed under the same License terms covering
zlib from Mark Adler (aka Zlib license):
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.

/* zlib.h -- interface of the 'zlib' general purpose compression library
  version 1.2.11, January 15th, 2017
  Copyright (C) 1995-2017 Jean-loup Gailly and Mark Adler
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution.
  Jean-loup Gailly        Mark Adler
  jloup@gzip.org          madler@alumni.caltech.edu
  The data format used by the zlib library is described by RFCs (Request for
  Comments) 1950 to 1952 in the files http://tools.ietf.org/html/rfc1950
  (zlib format), rfc1951 (deflate format) and rfc1952 (gzip format).
*/

Disclaimer

This software is provided "as is", without warranty of any kind, express or implied. In no event will the authors be held liable for any damages arising from the use of this software.

Author

by Roberto S. Galende

on code by Mark Adler's zlib.