codyps / hash-roll

A variety of content chunking algorithms with a common API in rust
GNU Affero General Public License v3.0
18 stars 2 forks source link

BuzHash doesn't actually hash #3

Open BartMassey opened 6 years ago

BartMassey commented 6 years ago

The BuzHash and BuzHashBuf structs need to be clearly marked in the documentation as not yet working. There is no hash function yet, as far as I can tell.

I almost tried to use this thing in a real implementation. Fortunately, I went and looked at the source to find out what cyclic polynomial was being used.

Best would probably to be to move buzhash to a development branch until it is finished.

leeola commented 4 years ago

Thanks for the warning, this almost caught me too.

codyps commented 4 years ago

Hi folks. I've just completed a major re-write of all the chunking algorithms included in hash-roll (which is now somewhat misnamed).

As a result it now does pass input characters through a 32-bit mapping that is selectable.

It probably still needs some help though: there is not currently a validated buzhash test-vector included (we only test against ourselves to ensure we don't unexpectedly change behavior.

The current release (0.3.0, which is the rewrite), includes a program to generate test data in the form that the current test vectors use (cargo run --examples generate-test-data 0 32768 >test_data_0_32K.bin). It would be great if we could validate that things match up with the buzhash impl you're looking at (and if not, let me figure out how to extend the current impl to support it)

BartMassey commented 4 years ago

After skimming the code, I'm not quite following what's been done.

The "normal" way to test a rolling hash is to apply the non-rolling version to some interior window and compare it with the rolling version on that same window. For a BuzHash implementation this would presumably be a CRC code. Again, what polynomial and initial value are assumed for this BuzHash implementation? It's not obvious to me how the code works.

This is my old implementation https://github.com/BartMassey/rolling-crc-rs . Perhaps it would be useful to you somehow. It's not fast, but it's been checked pretty well.

codyps commented 4 years ago

Interesting. I had looked at the implementations of buzhash in both attic and borg. Both of these use a simple hashing transform (a table lookup) and perform separate hashing over the content of the chunk. That's basically the design used currently in hash-roll: it assumes the internal hash is a "throw away", and not used outside of deciding chunk boundaries. That decision is contained entirely within the hash.

Is it the case that you'd like to be able to obtain the value of the rolling hash as well as the chunk boundaries?

Also, the other distinct concept here appears to be a hash that has "history" (correct me if my examination here is off base, I'm skimming your code): the hash function takes as it's input both it's previous value and an updated value. This doesn't appear to be quite described by the forms used in Jonathan D. Cohen, Recursive Hashing Functions for n-Grams, ACM Trans. Inf. Syst. 15 (3), 1997. "Hashing by Cyclic Polynomials", which I believed was generally taken as the definition of a buzhash.

It seems like rolling-crc is a somewhat different type of construction from buzhash.

To answer directly: while buzhash is called "Hashing by Cyclic Polynomials", it's construction isn't formed in the same was as a CRC. Take a look at the paper I linked for a math-y definition, but it's probably easier to examine some other buzhash implimentations (I've linked them in the hash-roll docs for the buzhash module) or the equations on wikipedia

BartMassey commented 4 years ago

The first sentence of the Wikipedia article: "A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input." BuzHash is such a rolling hash; the paper you referenced talks about this as well (see the first equation on p. 295).

The idea is that you want to find the hash of each n-Gram (window of size n) in a text of length m in time O(m) instead of O(mn). So you adjust the hash of the previous n-Gram to be the hash of the current n-Gram in constant time for each byte. For BuzHash, which is a CRC polynomial hash, this involves updating using a table of CRC polynomial values. (See section 2.3 of the referenced paper.)

I haven't looked at the Attic code in a long time, and don't recall looking at Borg at all.

I do not need the rolling hash value (I originally came here just to compare my rolling-crc-rs with your BuzHash implementation). However, it seems like you'd have to compute chunk boundaries using a rolling CRC (perhaps implemented using an ordinary CRC and running in O(nm) time) if you want the chunk boundaries to match those produced by other implementations, or if you want the distribution/quality of chunks to be well-randomized.

Like I say, I don't quite understand your code: maybe you're already doing all this in a way I don't understand. I'm easily confused.

Anyway, don't mean to be difficult — apologies if I came across that way. Let me know if there's some way I can be of assistance.

leeola commented 4 years ago

If more implementations references are desired: https://github.com/silvasur/buzhash/blob/master/hash.go is what Noms DB and Dolt DB use. Though i can't comment on anything :D, just adding a link.

codyps commented 4 years ago

If more implementations references are desired: https://github.com/silvasur/buzhash/blob/master/hash.go is what Noms DB and Dolt DB use. Though i can't comment on anything :D, just adding a link.

Cool, always appreciate additional references to buzhash (and other algorithm) users. silvasur/buzhash ~is also used by attic~. I'll need to reorganize the docs in the buzhash module a bit.

Edit: Apparently I've gotten my references mixed up. really need to straighten them out (attic appears to use it's own c impl)

codyps commented 4 years ago

Anyway, don't mean to be difficult — apologies if I came across that way. Let me know if there's some way I can be of assistance.

You're not coming across as difficult. I appreciate the feedback & insight into what you're looking for.

The first sentence of the Wikipedia article: "A rolling hash (also known as recursive hashing or rolling checksum) is a hash function where the input is hashed in a window that moves through the input." BuzHash is such a rolling hash; the paper you referenced talks about this as well (see the first equation on p. 295).

The idea is that you want to find the hash of each n-Gram (window of size n) in a text of length m in time O(m) instead of O(mn). So you adjust the hash of the previous n-Gram to be the hash of the current n-Gram in constant time for each byte. For BuzHash, which is a CRC polynomial hash, this involves updating using a table of CRC polynomial values. (See section 2.3 of the referenced paper.)

Section 2.3 is the section I'm referencing.

I don't think the paper is calling for the use of a CRC. Here's an excerpt that is informing my interpretation:

The symbol transformation is a lookup table containing “random” elements of GF(2)[x]/(x**w + 1), indexed by the ordinal value of its argument.In the author’s practice, the table consists of words whose contents are filled from a computer’s random-number generator. In this manner, each symbol’s contribution is scattered across the word, changing about half of the bits, on average.

My reading here is that the transform from symbols (bytes) into the field is done with random values that aren't related to any CRC function.

It seems that the author of the paper is using the GF(2) polynomial construction to describe the operations taking place, but there doesn't appear to be a similar construction to CRC (no real generator polynomial, etc). It's possible I'm misreading though: my math background is limited.

Directly to the windowing question: the code already should have the O(m) behavior you've described. It's a bit tricky to follow because I've implimented 2 slightly different types of APIs.

https://github.com/jmesmon/hash-roll/blob/6a11f54bfddcfc144a23217a2cdb4a59c8e7523a/src/buzhash.rs#L192-L215

    fn add_buf<H: BuzHashHash>(&mut self, data: &[u8], params: &BuzHash<H>, i: usize) {
        if i >= params.k {
            // need to find and "remove" a entry
            let drop_i = i - params.k;
            let drop = data[drop_i];
            self.add_overflow(params, data[i], drop);
        } else {
            // no removal
            self.add(params, data[i]);
        }
    }

    // insert, assuming no overflow
    fn add<H: BuzHashHash>(&mut self, params: &BuzHash<H>, v: u8) {
        self.h = self.h.rotate_left(1) ^ params.h.hash(v);
    }

    // insert with overflow
    fn add_overflow<H: BuzHashHash>(&mut self, params: &BuzHash<H>, add_v: u8, remove_v: u8) {
        let h = self.h.rotate_left(1);
        // need to find and "remove" a entry
        let drop = params.h.hash(remove_v).rotate_left((params.k % 8) as u32);
        self.h = h ^ drop ^ params.h.hash(add_v);
    }

This is an internal API, but I think it covers the item at discussion.

For simplicity, consider add_buf() to be the "entrypoint" here. It is managing deciding if it needs to "remove" previous data from the hash. add() is additional without a removal, and add_overflow is addition with the removal.

(for those following along who aren't familiar) The ^ here correspond to addition (+) in the paper's equations. Our params.h is the paper's T function (the transform). The paper's H is the entire add_buf() operation (with self.h being the output). The rotate_left is called a "barrel shift" in the paper and corresponds to multiplication by x in the paper's equations.

(Note that for splitting purposes, add_buf() isn't always the entry point, to split self.h is examined after each byte addition that is permitted to generate a split).