Combining efforts around data storage / deduplication?

dpc commented 7 years ago

Hi everyone,

I've created this issue under rsroll because my initial idea was to introduce authors of 3-different software packages focused rolling sums / CDC, and rsroll is the one that rdedup is currently using, and "the oldest" according to my knowledge. However, when writing this post, it appeared to me that maybe expanding the whole theme toward general data deduplication or even data industry (backup, data storage, etc) might make sense.

Let me start by stating my opinion, that Rust is very well positioned to excel in the data industry. The general attitude toward reliability, easiness of writing parallel code and avoiding copying, zero-cost abstractions, possibilities of exposing C-APIs, etc. make it a great tool for the job.

@aidanhs is the author of rollsum/rsroll, and I am aware that he is currently busy with more Rust-core related work.

@jmesmon is the author https://github.com/jmesmon/hash-roll which is quite a comparable crate, that has more rolling sums algos already implemented and I really like some of the abstractions there: https://docs.rs/hash-roll/0.2.0/hash_roll/trait.Splitter.html

On reddit I've got in touch with @RAOF, who is working on https://github.com/RAOF/chunky , which is focused on Content Defined Chunking, a very related subject to rolling sums.

I am the the author of https://github.com/dpc/rdedup, which is data deduplication engine / backup tool, that has reached level of maturity and functionality comparable to more estabilished alternatives in different programming languages. @phillipCouto is the biggest contributor to rdedup.

So my question would be: would you be interested in cooperating? There is no need to commit to anything (like time that we are all probably always short on).

In practical terms: for starter, we could create a github community around data/data deduplication for Rust. I would move rdedup under that ubrella. On top of that we could work on combining all three libraries under more broad "data deduplication algorithms" (rollsum, CDC, etc.) .

We could ping other authors of backup and data tools in Rust, have an IRC channel etc. just like other topic-center communities (gamedev/graphic, embedded, webdev, networking, etc) do.

One way, or another, thanks for your attention and all the good work!

dpc commented 7 years ago

@remram44 is the author of https://github.com/remram44/adler32-rs , which seems to be used by https://gitlab.wellbehavedsoftware.com/well-behaved-software/rzbackup .

aidanhs commented 7 years ago

@dswd is the author of https://github.com/dswd/zvault which has a few chunking implementations at https://github.com/dswd/zvault/tree/master/chunking/src

aidanhs commented 7 years ago

I'd be ok with a shared org, my interest is more around the algorithms rather than tools.

I can imagine different people wanting different things from their tools though, so maybe it makes sense to keep them separate? Though then there would only be one crate under the org, the one containing the algorithms (and maybe my rollsum testing/benching repo at https://github.com/aidanhs/rollsum-tests?).

I don't particularly have strong opinions, but it would be nice to deduplicate some of the work here ;)

remram44 commented 7 years ago

Indeed a few of those crates are implementing the same thing. I agree that it would be best to have at least a common interface or trait for chunking algorithms, so we all speak the same language.

Looking over the APIs for the crates you linked, none of them seem general enough. My use case is iterating on chunks streamed out of a Read object (without reading the whole thing to memory first).

Interestingly enough I have another implementation here which is used like this (you can see it reuses the buffer, and never holds more than BUF_SIZE at once in memory).

It's good to see an organized effort happening here!

RAOF commented 7 years ago

I think there are at least two top-level traits here (not actual Rust):

trait ZeroCopyChunker {
    type OutputIterator : Iterator<Item = &[u8]>;
    fn chunk(data : &[u8]) -> Self::OutputIterator
}

and

trait Chunker {
    type InnerOutputIterator : Iterator<Item = u8>;
    type OutputIterator : Iterator<Item = Self::InnerOutputIterator>;
    fn chunk<T : Iterator<Item = u8>>(data : T) -> Self::OutputIterator
}

The former allows zero-copy slicing up a buffer into its component chunks, but requires a contiguous buffer to start with, the latter is more general and allows a finer-grained input buffer.

It's trivial to adapt any R : Read into an iterator for the second one, for example.

dpc commented 7 years ago

@remram44 You might be interested in an article I wrote about how this is being done rdedup (using current rsroll crate): https://dpc.pw/blog/2017/04/rusts-fearless-concurrency-in-rdedup/. It is all based on scatter-gather abstraction: https://github.com/dpc/rdedup/blob/master/sgdata/src/lib.rs to allow massive zero-copy parallelism. I looked at dhstore, and while trying to address a different problem, I'm guessing that 90% of deduplication functionality would be shared. I even wonder if you could reuse rdedup-lib as whole, and make the DHT part as a backend. Just a thought.

@RAOF , Yes that looks like a nice, idiomatic API. Also, we will have to distinguish between rolling sum engine and CDC traits.

dswd commented 7 years ago

@RAOF Are you sure the compiler will be able to optimize the Chunker interface into memcpy. To me it looks too complicated for the compiler to be able to make this optimization. If that is true, anything using Chunker will be awfully slow.

If I were to propose a generic chunker interface, it would be something along these lines:

enum ChunkResult {
    Continue(usize),
    Stop
}

trait Chunker {
    fn chunk(in: &mut Read, out: &mut Write) -> io::Result<ChunkResult>;
    fn chunk_size(in: &mut Read) -> io::Result<ChunkResult>;
}

RAOF commented 7 years ago

@dswd Depending on the internal implementation it should be quite simple for the compiler to optimise. For example, with InnerOutputIterator = Vec::IntoIter<u8> the chunk() function would just return something morally equivalent to data.take(next_cut_point).collect::<Vec<u8>>().into_iter();

Edit: Particularly, I'd expect trait Chunker to primarily be used via compile-time monomorphisation, rather than through a trait object. In that case the compiler knows all the types involved, and can optimise better than going through a &mut Read trait object (which requires dynamic dispatch, and doesn't give the compiler the same wealth of information).

dswd commented 7 years ago

So how about this?

trait Chunker {
    fn chunk<R: Read, W: Write>(in: &mut R, out: &mut W) -> io::Result<ChunkResult>;
    fn chunk_size<R: Read>(in: &mut R) -> io::Result<ChunkResult>;
}

I actually tested the performance difference in my code of trait object vs. monomorphism but it came out to be almost zero. Maybe this is because the calls to Read::read and Write::write_all are not hot, so one layer of indirection does not hurt here?

RAOF commented 7 years ago

The out parameter is pretty weird; what is it writing?

For the general interface I think an Iterator is more appropriate - you're not always going to be chunking files, and it's easy to go Read → Iterator but a bit harder to go the other way.

dswd commented 7 years ago

The out parameter is receiving the same data that is being read from the in parameter, so when used in this mode, the chunker is just streaming the data from in to out in chunks of its liking.

RAOF commented 7 years ago

Huh. Why is it doing that? What's the point of copying from in to out when in could just have an internal buffer?

dswd commented 7 years ago

It still needs to have an internal buffer as reading and writing bytewise is too slow but this way, that internal buffer can be of small fixed size as it does not have to fit a whole chunk but only the window of the rolling hash.

RAOF commented 7 years ago

I guess I should probably read the code in context, because I still don't see the point in copying from in to out rather than having large-enough buffer on in - 20MB would fit any conceivable chunk size, and would be perfectly reasonable.

remram44 commented 7 years ago

I'm thinking about an API like this: https://github.com/remram44/cdchunking-rs/blob/fb72120869ab0ba070b30079e4d9ab735f0fb011/src/lib.rs#L19-L65

remram44 commented 7 years ago

If you're reading from a file and writing chunks out, either to separate files on disk or to the network, requiring a buffer that can hold wholes chunks at a time is completely unnecessary. Having an way of getting chunks a bit at a time and knowing when a new chunk starts is perfectly doable and I don't believe it complicates the API unnecessarily (and of course both ways can be exposed).

dpc commented 7 years ago

I think for starter, we should move the discussion to a separate issue and discuss the API there. :)

aidanhs commented 7 years ago

Another one: @green-coder is the owner of https://github.com/green-coder/cdc

aidanhs commented 7 years ago

One thought on the general structure - as pointed out https://github.com/aidanhs/rsroll/pull/11/files#r131031976, rolling checksums (the namesake of this repo) like bupsplit, rabin, gear are a subset of algorithms used for chunking (which include fastcdc etc).

So it probably makes sense to separate out the different expected APIs, possibly into different crates - one for rolling checksums (possibly an api similar to the one in this crate), one for the higher level concept of chunking (with apis dealing with readers/writers and so on).

What do people think of this structure? If there's agreement from a few people then the org and two repos can be created and API discussion can move to the appropriate place. Or, if people have other thoughts, I'd love to hear them!

There is also the question of licenses, which I just realised is an important one - if we can't agree then it's difficult to collaborate :) For libraries like this repo I tend to prefer 'permissive' licenses (bsd, apache, mit). I don't particularly have a license preference for tools/applications, if we decide the org is the correct place to host them.

dpc commented 7 years ago

I am personally a GPL-ish person, but since this whole effort would be to be a backbone of Rust ecosystem, it only makes sense to make it as permissive as possible. On top of it, we can build our own toys which licenses of our choosing.

remram44 commented 7 years ago

rolling checksums [...] like bupsplit, rabin, gear are a subset of algorithms used for chunking (which include fastcdc etc).

Note that chunking is also a subset of what rolling checksums are good at (for instance, in rsync, they are used to find the original chunks in modified files in linear time).

A good model might be to keep the rolling checksums in their own crates, and have a chunking crate that depends on those? With each particular algorithm gated behind a cargo feature?

I'm working towards that in my tentative cdchunking crate, where a ChunkerImpl trait will be implemented for each rolling hash.

remram44 commented 7 years ago

On the other hand rolling hashes are rather simple software-wise, merely algorithms with very little dependencies, and putting that code inside the chunking crate (possibly behind cargo features) should be easy as well.

dswd commented 7 years ago

As stated in #11, I am ok with licensing my chunking code under BSD/MIT although I would prefer LGPL.

Also I would contribute my code for a planned chunking crate if the crate provides an interface that is efficient for my use case.

dpc commented 7 years ago

Please move the API discussion to #13. Thank you! :)

aidanhs / rsroll

Combining efforts around data storage / deduplication? #9