Open dpc opened 7 years ago
@remram44 is the author of https://github.com/remram44/adler32-rs , which seems to be used by https://gitlab.wellbehavedsoftware.com/well-behaved-software/rzbackup .
@dswd is the author of https://github.com/dswd/zvault which has a few chunking implementations at https://github.com/dswd/zvault/tree/master/chunking/src
I'd be ok with a shared org, my interest is more around the algorithms rather than tools.
I can imagine different people wanting different things from their tools though, so maybe it makes sense to keep them separate? Though then there would only be one crate under the org, the one containing the algorithms (and maybe my rollsum testing/benching repo at https://github.com/aidanhs/rollsum-tests?).
I don't particularly have strong opinions, but it would be nice to deduplicate some of the work here ;)
Indeed a few of those crates are implementing the same thing. I agree that it would be best to have at least a common interface or trait for chunking algorithms, so we all speak the same language.
Looking over the APIs for the crates you linked, none of them seem general enough. My use case is iterating on chunks streamed out of a Read
object (without reading the whole thing to memory first).
Interestingly enough I have another implementation here which is used like this (you can see it reuses the buffer, and never holds more than BUF_SIZE
at once in memory).
It's good to see an organized effort happening here!
I think there are at least two top-level traits here (not actual Rust):
trait ZeroCopyChunker {
type OutputIterator : Iterator<Item = &[u8]>;
fn chunk(data : &[u8]) -> Self::OutputIterator
}
and
trait Chunker {
type InnerOutputIterator : Iterator<Item = u8>;
type OutputIterator : Iterator<Item = Self::InnerOutputIterator>;
fn chunk<T : Iterator<Item = u8>>(data : T) -> Self::OutputIterator
}
The former allows zero-copy slicing up a buffer into its component chunks, but requires a contiguous buffer to start with, the latter is more general and allows a finer-grained input buffer.
It's trivial to adapt any R : Read into an iterator for the second one, for example.
@remram44 You might be interested in an article I wrote about how this is being done rdedup
(using current rsroll
crate): https://dpc.pw/blog/2017/04/rusts-fearless-concurrency-in-rdedup/. It is all based on scatter-gather abstraction: https://github.com/dpc/rdedup/blob/master/sgdata/src/lib.rs to allow massive zero-copy parallelism. I looked at dhstore, and while trying to address a different problem, I'm guessing that 90% of deduplication functionality would be shared. I even wonder if you could reuse rdedup-lib
as whole, and make the DHT part as a backend. Just a thought.
@RAOF , Yes that looks like a nice, idiomatic API. Also, we will have to distinguish between rolling sum engine and CDC traits.
@RAOF Are you sure the compiler will be able to optimize the Chunker
interface into memcpy. To me it looks too complicated for the compiler to be able to make this optimization. If that is true, anything using Chunker
will be awfully slow.
If I were to propose a generic chunker interface, it would be something along these lines:
enum ChunkResult {
Continue(usize),
Stop
}
trait Chunker {
fn chunk(in: &mut Read, out: &mut Write) -> io::Result<ChunkResult>;
fn chunk_size(in: &mut Read) -> io::Result<ChunkResult>;
}
@dswd Depending on the internal implementation it should be quite simple for the compiler to optimise. For example, with InnerOutputIterator = Vec::IntoIter<u8>
the chunk() function would just return something morally equivalent to data.take(next_cut_point).collect::<Vec<u8>>().into_iter();
Edit: Particularly, I'd expect trait Chunker
to primarily be used via compile-time monomorphisation, rather than through a trait object. In that case the compiler knows all the types involved, and can optimise better than going through a &mut Read
trait object (which requires dynamic dispatch, and doesn't give the compiler the same wealth of information).
So how about this?
trait Chunker {
fn chunk<R: Read, W: Write>(in: &mut R, out: &mut W) -> io::Result<ChunkResult>;
fn chunk_size<R: Read>(in: &mut R) -> io::Result<ChunkResult>;
}
I actually tested the performance difference in my code of trait object vs. monomorphism but it came out to be almost zero. Maybe this is because the calls to Read::read
and Write::write_all
are not hot, so one layer of indirection does not hurt here?
The out
parameter is pretty weird; what is it writing?
For the general interface I think an Iterator is more appropriate - you're not always going to be chunking files, and it's easy to go Read
→ Iterator
but a bit harder to go the other way.
The out
parameter is receiving the same data that is being read from the in
parameter, so when used in this mode, the chunker is just streaming the data from in to out in chunks of its liking.
Huh. Why is it doing that? What's the point of copying from in
to out
when in
could just have an internal buffer?
It still needs to have an internal buffer as reading and writing bytewise is too slow but this way, that internal buffer can be of small fixed size as it does not have to fit a whole chunk but only the window of the rolling hash.
I guess I should probably read the code in context, because I still don't see the point in copying from in
to out
rather than having large-enough buffer on in
- 20MB would fit any conceivable chunk size, and would be perfectly reasonable.
I'm thinking about an API like this: https://github.com/remram44/cdchunking-rs/blob/fb72120869ab0ba070b30079e4d9ab735f0fb011/src/lib.rs#L19-L65
If you're reading from a file and writing chunks out, either to separate files on disk or to the network, requiring a buffer that can hold wholes chunks at a time is completely unnecessary. Having an way of getting chunks a bit at a time and knowing when a new chunk starts is perfectly doable and I don't believe it complicates the API unnecessarily (and of course both ways can be exposed).
I think for starter, we should move the discussion to a separate issue and discuss the API there. :)
Another one: @green-coder is the owner of https://github.com/green-coder/cdc
One thought on the general structure - as pointed out https://github.com/aidanhs/rsroll/pull/11/files#r131031976, rolling checksums (the namesake of this repo) like bupsplit, rabin, gear are a subset of algorithms used for chunking (which include fastcdc etc).
So it probably makes sense to separate out the different expected APIs, possibly into different crates - one for rolling checksums (possibly an api similar to the one in this crate), one for the higher level concept of chunking (with apis dealing with readers/writers and so on).
What do people think of this structure? If there's agreement from a few people then the org and two repos can be created and API discussion can move to the appropriate place. Or, if people have other thoughts, I'd love to hear them!
There is also the question of licenses, which I just realised is an important one - if we can't agree then it's difficult to collaborate :) For libraries like this repo I tend to prefer 'permissive' licenses (bsd, apache, mit). I don't particularly have a license preference for tools/applications, if we decide the org is the correct place to host them.
I am personally a GPL-ish person, but since this whole effort would be to be a backbone of Rust ecosystem, it only makes sense to make it as permissive as possible. On top of it, we can build our own toys which licenses of our choosing.
rolling checksums [...] like bupsplit, rabin, gear are a subset of algorithms used for chunking (which include fastcdc etc).
Note that chunking is also a subset of what rolling checksums are good at (for instance, in rsync, they are used to find the original chunks in modified files in linear time).
A good model might be to keep the rolling checksums in their own crates, and have a chunking crate that depends on those? With each particular algorithm gated behind a cargo feature?
I'm working towards that in my tentative cdchunking crate, where a ChunkerImpl
trait will be implemented for each rolling hash.
On the other hand rolling hashes are rather simple software-wise, merely algorithms with very little dependencies, and putting that code inside the chunking crate (possibly behind cargo features) should be easy as well.
As stated in #11, I am ok with licensing my chunking code under BSD/MIT although I would prefer LGPL.
Also I would contribute my code for a planned chunking crate if the crate provides an interface that is efficient for my use case.
Please move the API discussion to #13. Thank you! :)
Hi everyone,
I've created this issue under
rsroll
because my initial idea was to introduce authors of 3-different software packages focused rolling sums / CDC, andrsroll
is the one thatrdedup
is currently using, and "the oldest" according to my knowledge. However, when writing this post, it appeared to me that maybe expanding the whole theme toward general data deduplication or even data industry (backup, data storage, etc) might make sense.Let me start by stating my opinion, that Rust is very well positioned to excel in the data industry. The general attitude toward reliability, easiness of writing parallel code and avoiding copying, zero-cost abstractions, possibilities of exposing C-APIs, etc. make it a great tool for the job.
@aidanhs is the author of
rollsum
/rsroll
, and I am aware that he is currently busy with more Rust-core related work.@jmesmon is the author https://github.com/jmesmon/hash-roll which is quite a comparable crate, that has more rolling sums algos already implemented and I really like some of the abstractions there: https://docs.rs/hash-roll/0.2.0/hash_roll/trait.Splitter.html
On reddit I've got in touch with @RAOF, who is working on https://github.com/RAOF/chunky , which is focused on Content Defined Chunking, a very related subject to rolling sums.
I am the the author of https://github.com/dpc/rdedup, which is data deduplication engine / backup tool, that has reached level of maturity and functionality comparable to more estabilished alternatives in different programming languages. @phillipCouto is the biggest contributor to
rdedup
.So my question would be: would you be interested in cooperating? There is no need to commit to anything (like time that we are all probably always short on).
In practical terms: for starter, we could create a github community around data/data deduplication for Rust. I would move
rdedup
under that ubrella. On top of that we could work on combining all three libraries under more broad "data deduplication algorithms" (rollsum, CDC, etc.) .We could ping other authors of backup and data tools in Rust, have an IRC channel etc. just like other topic-center communities (gamedev/graphic, embedded, webdev, networking, etc) do.
One way, or another, thanks for your attention and all the good work!