circulosmeos / gztool

extract random-positioned data from gzip files with no penalty, including gzip tailing like with 'tail -f' !
https://circulosmeos.wordpress.com/2019/08/11/continuous-tailing-of-a-gzip-file-efficiently/
134 stars 12 forks source link

Only extract a subset of lines using gztool #9

Closed morispi closed 3 years ago

morispi commented 3 years ago

Hi,

I've been searching for a little while for a way to perform seek and tell operations on gzip files, which I need for one of my projects, and finally found your work. Thank you so very much for this!

However, it seems like the extraction functionalities always output the whole file, starting from the given line / offset. I would need a way to use gztool to, say, for instance, "Extract lines between 12 and 18", or "Extract content between offset 125 and offset 1500", but I don't seem to find any option allowing this?

I'm currently browsing through the source code trying to see if I can slightly alter it to fit my needs, but maybe I'm just overlooking an option that already exists?

Thanks, Pierre

circulosmeos commented 3 years ago

Hi @morispi, good to hear gztool is useful!

I would need a way to use gztool to, say, for instance, "Extract lines between 12 and 18", or "Extract content between offset 125 and offset 1500", but I don't seem to find any option allowing this?

No, there's no option for that actually - as it is a command-line tool, it's enough for me using it with a pipe to head for getting some first lines or to dd for getting some first bytes. But if this is not enough for you, I think it is feasible to add the option on a next gztool version - I must carefully think about it. I'd probably use -r # (r for range) to be used with -[bL] indicating the end line/byte (and maybe -R # for indicating the number of lines/bytes to extract, or vice versa).

circulosmeos commented 3 years ago

Hi @morispi I've published v1.3:

v1.3 allows to indicate the maximum number of bytes to extract (-r #) or the maximum number of lines to extract (-R #) when using -[bL] (that is, extract from the indicated byte (-b #) and extract from the indicated line (-L #)). All four parameters -[bLrR] can use SI suffixes like m fo millions, k for 1000, etc (uppercase for powers of 2: ...GMK) and prefixes for octal (0) and hexadecimal (0x) numbers.

For example to extract lines between 12 and 18 (both included!): gztool -L 12 -R 6 file.gz

and to extract content between offset 125 and offset 1500: gztool -b 125 -r 1375 file.gz

Also one can start on a line a indicate a range of bytes, or vice versa: gztool -L 3.7m -r 2k file.gz gztool -b 4.9M -R 7m file.gz

Any feedback is welcome :-)

circulosmeos commented 3 years ago

Since v1.4, just published now, -A can be used to indicate absolute values with -[rR], so for your examples:

For example to extract lines between 12 and 18 (both included!): gztool -L 12 -AR 18 file.gz

and to extract content between offset 125 and offset 1500: gztool -b 125 -Ar 1500 file.gz