Genivia / ugrep

NEW ugrep 6.5: a more powerful, ultra fast, user-friendly, compatible grep. Includes a TUI, Google-like Boolean search with AND/OR/NOT, fuzzy search, hexdumps, searches (nested) archives (zip, 7z, tar, pax, cpio), compressed files (gz, Z, bz2, lzma, xz, lz4, zstd, brotli), pdfs, docs, and more
https://ugrep.com
BSD 3-Clause "New" or "Revised" License
2.57k stars 109 forks source link

[FR] support bzip3 #311

Closed kspalaiologos closed 10 months ago

kspalaiologos commented 10 months ago

Hi!

As a user of this tool (and also the author of the codec), I feel like I would benefit a lot from the support of the bzip3 codec being added to the list of compressed tarballs/text files for search.

https://github.com/kspalaiologos/bzip3

genivia-inc commented 10 months ago

Agreed! Added to the TODO list for an upcoming release.

I see that the bzip3 licensing has no conflict when used as a library linked with ugrep.

I will need to write up a new m4 file to check for libbzip3 with autotools (I prefer the good old autotools over cmake because I have more power to do custom stuff that cmake can't do or does clumsily). I suppose the m4 file can be based on ugrep/m4/ax_check_bzlib.m4 that I wrote.

Thank your for your suggestion.

genivia-inc commented 10 months ago

Can you point me to the official bzip3 API documentation? I am having trouble to locate the API for use in C/C++.

It would be nice if the API supports streaming decompression so I can use API calls similar to gzip and bzip2.

Edit: the comments in libbz3.h are useful. These are Doxygen formatted, so I suspect there is a doc posted somewhere?

genivia-inc commented 10 months ago

Making some progress looking through your doc/ and the low-level API in libbz3.h, but have one question:

What is meant by "maximum block size" in the bzip3 file format? Is this the maximum size of the compressed blocks or the uncompressed blocks in a compressed file?

kspalaiologos commented 10 months ago

Hi, the pseudocode for decoding a .bz3 file made with the bzip3 command line frontend is as follows:

  1. Read a 5 byte signature, if not BZ3v1 then bail out.
  2. Read a 4-byte number using the procedure below as the block size. If not between 65KiB and 511MiB bail out.
  3. Make a new state using bz3_new(block_size) and a buffer with malloc that's as big as bz3_bound(block_size).
  4. While not at EOF: a) Read 4 byte new size followed by 4 byte old size using the procedure below. b) Make sure they are consistent, i.e. old_size > bz3_bound(block_size) || new_size > bz3_bound(block_size) or both being smaller than zero triggers an error. c) Read as many bytes as new_size says. c) bz3_decode_block(state, buffer, new_size, old_size). d) Consume buffer.
static s32 read_neutral_s32(const u8 * data) {
    return ((u32)data[0]) | (((u32)data[1]) << 8) | (((u32)data[2]) << 16) | (((u32)data[3]) << 24);
}
genivia-inc commented 10 months ago

Implemented and tests look good so far when searching a tar.bz3 file.

Two comments when using the API:

  1. bz3_decode_block() returns old_size which is not documented in libbz3.h. I'm using the return value to check if bz3_decode_block() was OK if the return value is old_size. However, the return value is a signed value so perhaps a negative value indicates an error? It's a bit confusing.
  2. The read_neutral_s32() returns a signed integer. I'm using my u32() unsigned decoders (little endian). Checking if the sizes this way does not need a test for negativity. A negative value in the input would wrap to a high unsigned integer to exceeds the threshold. I assume the file has unsigned integer sizes (by interpretation, it makes no sense to have negative values). I think you should just update your docs to use an unsigned read_neutral_u32() instead.
kspalaiologos commented 10 months ago

If bz3_decode_block returns -1 then an error occurred and the error message can be queried with bz3_strerror.

genivia-inc commented 10 months ago

I've committed an update with bzip3 support included (brotli was added yesterday).

The C++ source code that implements the decompressor is in ugrep/src/zstream.hpp. I've also written an m4/ax_check_bzip3lib.m4 script.

One more note: to compile with C++ and link libbzip3 I had to do this:

// if we have libbzip3
#ifdef HAVE_LIBBZIP3
extern "C" {
#include <libbz3.h>
}
#endif

which is a bit ugly. Perhaps you are willing to add extern "C" to libbz3.h to make it cleaner for C++ projects to use this lib.

It would be great if you can also check this committed version by downloading the repo then ./build.sh to build. Let's see if you can find any issues perhaps. That would be much appreciated!

kspalaiologos commented 10 months ago

Hi, that should not be the case:

https://github.com/kspalaiologos/bzip3/blob/master/include/libbz3.h#L44

... but a new release has not been published since then.

genivia-inc commented 10 months ago

I'm using homebrew to install bzip3, which definitely has no extern "C" there, since I've looked into /opt/homebrew/include/libbz3.h, which was freshly installed on my Mac M1 PRO.

genivia-inc commented 10 months ago

The bzip3 tool version shows bzip3 1.3.2 so I have the latest version.

genivia-inc commented 10 months ago

Bzip3 has a very good compression ratio, congrats!

100000000 enwik8 (original uncompressed)
 22677651 enwik8.bz3
 24865244 enwik8.xz
 25742899 enwik8.br
 29008758 enwik8.bz2
 31129606 enwik8.zst
 36475811 enwik8.gz

These (updated) results are produced with best compression (option -9).

But decompression is not particularly fast:

time ugrep -zc '' enwik8.bz3
1128024
8.261u 0.060s 0:08.31 100.1%

time ugrep -zc '' enwik8.br
1128024
0.256u 0.017s 0:00.26 100.0%

time ugrep -zc '' enwik8.xz
1128024
1.150u 0.025s 0:01.12 104.4%

time ugrep -zc '' enwik8.bz2
1128024
2.031u 0.021s 0:02.00 102.5%

time ugrep -zc '' enwik8.zst
1128024
0.181u 0.021s 0:00.18 111.1%

time ugrep -zc '' enwik8.gz
1128024
0.352u 0.015s 0:00.31 116.1%

The decompression is done with bz3_decode_block() in ugrep/src/zstream.hpp, which is not done in parallel to decompress multiple bzip3 blocks as the doc also hints at. I'm using threads to search in parallel recursively. Perhaps parallel decompression helps, but I worry that I'm already using a lot of CPU resources in recursive searches. Also spinning up and down more threads in bz3_decode_blocks() adds overhead.

kspalaiologos commented 10 months ago

The slow decompression is not always a problem, but sadly it's still a shame. For some of my applications however, this does not matter, as I use bzip3 for long term storage, NCD, etc... All of my backups are stored on WD Blue drives, which are notoriously slow. The index of all the filenames (and recursively tar files) is multiple gigabytes (so that i can search files quickly; i.e. find . | bzip3 > index.bz3 and then search it for the filename), which bzip3 manages to compress to a few hundred megabytes, but still: most of the overhead is reading the drive, and while the 1st bzip3 thread is decompressing, the 2nd is awaiting disk I/O. Also, I use a custom system as an alternative to solid and non-solid archives where deduplication is performed on a multi-gigabyte boundary with LSH file ordering, hence to decompress a single file using this custom archiver you only need to decompress the header (to find its location in the archive) and then (worst case scenario) decompress a few blocks around it to patch in the deduplicated parts of it :-). Of course, when grepping, these kinds of tricks don't exactly work, but if you ask ugrep to search for something in the middle of a substring you're looking for, then the deduplication part probably hasn't taken care of it.

genivia-inc commented 10 months ago

The slow decompression is not always a problem, but sadly it's still a shame. For some of my applications however, this does not matter, as I use bzip3 for long term storage, NCD, etc...

I agree that storage on slow FS is a valid use case to use bzip3 compressed files, especially since the high compression rate is excellent.

Perhaps you might be interested in the ugrep-indexer sub-project. I plan to include compression support, so that the contents of compressed files can be indexed too. I will get back to work on that project soon. This helps to speed up recursive searches. On the other hand, for scenarios when we already know what file(s) to search that have matches and specify them as search targets, then indexing isn't helpful. Indexing requires extra space, but this can be tuned up/down. Indexing compressed files will help to speed up searching, but it is counter-intuitive, because we get more space with compression but then we reduce available space with indexing. I'm also considering implementing an option to place a single combined index file in one spot rather than per directory (there are pro/cons for each).

kspalaiologos commented 10 months ago

Tried it out just now and it seems to be working well. Thanks!

genivia-inc commented 9 months ago

A bit of bad news I'm afraid.

Some problems are reported with bzip3. See #316, which claims that bzip3 tests did not pass. For the next release v4.3.4 bzip3 will not be used by ugrep unless ./build.sh --with-bzip3 is configured to include bzip3 decompression specifically.