aidanhs / rsroll

https://crates.io/crates/rollsum
Other
8 stars 6 forks source link

Different implementation? #3

Closed dpc closed 7 years ago

dpc commented 8 years ago

For rdedup I'll need a bigger average chunks (smaller chunks leak more information about the data). Before I start do dig into it, do you have any recommendations or plans to add different algorithm yourself?

aidanhs commented 8 years ago

I'm probably going to implement a rabin fingerprint at some point, but you don't need a different algorithm to make larger chunks (I apologise if any of the below is obvious/I've missed your point):

Currently the bup rollsum uses the following by default:

pub const CHUNK_SIZE: u32 = 1 << CHUNK_BITS;

/// Default chunk size used by `bup` (log2)
pub const CHUNK_BITS: u32 = 13;
[...]
    pub fn find_chunk_edge(&mut self, buf: &[u8]) -> Option<usize> {
        self.find_chunk_edge_cond(buf, |e : &Bup |
            (e.digest() & (CHUNK_SIZE - 1)) == (CHUNK_SIZE - 1)
        )
    }

And here is a paragraph I wrote many moons ago about split sizes:

Of course, the example rule of 4 consecutive bytes of 0 is a very weak condition and exhibits pathological behaviour with a file consisting of long streams of 0’s. An alternative is slide an n (say 128) byte window over the file, taking a cryptographically secure hash of the window at each byte offset and define that a split point will be when this hash (assuming an output values are from 0 to T) is between 0 and T/2^13. Given that the hash is cryptographically secure and output values are uniformly distributed, the probability of there being a split is 2^-13, hence implying that on average there will be a split once every 2^13 bytes (8KB). This split rule can be scaled appropriately to give byte sizes as requested.

So you just need to use find_chunk_edge_cond with an appropriate CHUNK_SIZE. For example, a value of 1 << 17 (the value I used when implementing something like rdedup) gives you a split ~every 131KB. Assuming my maths is correct!

dpc commented 8 years ago

Ah, right. I guess I can use that for now. Thanks.

aidanhs commented 8 years ago

Is there some longer term solution you'd prefer? When I implement Rabin I'll be using the traits you kindly contributed, so usage will look basically the same.

One of the things I liked about your rusty redesign was the ability to give your own split conditions - different people will have different needs.

dpc commented 8 years ago

I don't think I have any sophisticated requirements. For whatever reason I though that using Bup with bigger chunks might be not enough, but now I see that's totally feasible.

dpc commented 7 years ago

Hello again! Slowly more work was put into optimizing rdedup and recently we've got to a point where it seems that even with all overhead (encryption, compression, io), the rolling sum is our bottleneck (https://github.com/dpc/rdedup/pull/91).

I am seeking advice on how to proceed. Is Rabin faster? Are you aware of any other algorithms with better performance? rdedup now has support for different chunking, compression, encryption etc. so there is a plan to add more algos. over time.

aidanhs commented 7 years ago

I'll comment on the issue over on your repo :)

dpc commented 7 years ago

Thanks!

dpc commented 7 years ago

This is WIP, and I think we can just close.