file type based chunking

ThomasWaldmann commented 8 years ago

depending on the file type (filename extension), we could use different chunkers.

e.g. for uncompressed tar files, we could split at file boundaries to better deduplicate different tars that share a lot of files.

Ideas by @dragetd (I removed the parts looking into compressed formats or into filesystem images - going there would involve exact reproduction of compressed archives / of the filesystem image which is not trivial and not in the scope of this ticket):

It could chunk along boundaries of files not only within TARs, [...] header information of media-files that might change more likely compared to the data-stream; [...] EXIF-tags and some raw image formats might contain previews that might fit neatly into a chunk… etc. etc.

I could see some real benefits. And funky heuristics and support for certain filetypes could be added incrementally without harming any compatibility. And yet, it would require a lot of work and knowledge of all the various formats… and it needs to be crafted carefully so the parser will robustly only output suggested chunk-sizes to the chunker, and not be prone to security issues when trying to parse dozens of filetypes.

dhouck commented 7 years ago

I think this could be very useful. I often have the same file in some container and out, and this could make those more likely to be deduplicated at least when the container isn’t compressed.

As for knowing a lot about several different file types, if you make it sufficiently modular that it’s easy to contribute for some file types, especially if separately-built plugins are possible, that makes the problem much easier.

RonnyPfannschmidt commented 7 years ago

If users could opt in to consider even compressed files I think this could save quite a bit (I am fine with equivalent restore)

dhouck commented 7 years ago

So am I, but it would need to be off by default and would probably be harder to implement, so getting it working in general before tackling compression is probably a good idea.

srd424 commented 6 years ago

I've just started looking at Borg for storing disk images, and this is an interesting idea. One approach that occurs to me is a method to provide "hints" to the chunker, say in the form of a file/stream of offsets of the end of extends/records in the file being stored. This way the per-file-type logic could live outside of the Borg source tree.

My first target for something like this would be to hack up partclone or e2image to spit out information on the extents inside an ext4 image. Something similar could presumably be easily done for tar files.

Interestingly at the moment, e2image output stores much more nicely in borg than partclone images, even with partclone's embedded checksumming disabled. I'm wondering if the runs of zeros that e2image spits out for unused blocks are effectively signalling "end of chunk" to the algorithm?

ThomasWaldmann commented 6 years ago

@srd424 that doesn't sound like related to this ticket (see top post) because you provide file / extent boundaries but not file type information when dealing with images.

How many zeros you need to trigger a chunk cut could be experimentally determined by feeding more and more zeros into the chunker and determine when it starts producing more than 1 chunk.

callegar commented 5 years ago

To start with, even without the development of fine heuristics, would it make sense to allow the user to provide chunking hints on a file by file basis either via a .borgattributes file (droppable on any directory in the spirit of .gitattributes) or via some custom extended attributes?

Hints could include the name of a chunking algorithm (so as to provide for multiple chunking algorithms) and the corresponding parameters.

For instance, when backing up home I'd like to be able to tell borg to use a small chunking size for source files and a huge one for my music collection.

ThomasWaldmann commented 5 years ago

@callegar not sure if it makes a big difference. source files tend to be smaller than the usual 2MiB target chunk size. and as 1 file means at least 1 chunk, the resulting chunk size is often the file's size (if the file size is also below the minimum chunk size). using a super-fine granularity (like target chunk size of a few kiB) would vastly increase the amount of chunks created and also their management overhead (memory usage, on-disk index file size). attic had 64kiB target chunk size and people ran out of memory.

callegar commented 5 years ago

Well, I find this to be a sort of recurring pattern: a very large number of smaller files that change frequently and a limited number of very large files that do not change at all but at most disappear and get replaced by other files that do not produce binary deltas wrt them (e.g., because published movies and videos are unique by definition, gz, tar.gz, png files are not suitable for binary deltas, git packages do not produce deltas one wrt the other, etc.). With this, chunking can be advantageous only for the large sea of files that change frequently as long as the chunk size is small. Unless the chunking size is small enough for the files that are actually deltable, chunking risks ending up as a mere overhead wrt a traditional file based incremental backup. The problem is that collecting stats is expensive by definition (means duplicating an archive and recreating it with different chunking parameters). Still I would be curious about recreating my archives with progressively increasing chunk sizes starting at the attic defaults up to chunks so huge to make borg effectively behave as a file-based incremental backup, to see where the point of max size derivative occurs.

ThomasWaldmann commented 5 years ago

Yes, the relatively expensive content-defined chunking is sometimes more effort than it is resulting in actual deduplication.

But, OTOH:

we need some sort of chunking anyway, even if we know it likely won't dedup - to tear big files into manageable chunks (just for storage and easy in-memory processing needs)
the cdc code has some internal optimization so it does not really do cdc if the input size is already below the minimum wanted chunk size. same for the start of any chunk: no need to search with cdc for a cutting point as long as offset is below minimum wanted chunk size.
we maybe don't want to put a lot of effort into micro-managing things, esp. if the actual benefit is unclear (e.g. we already had some file-extension based compression dispatcher, but later decided to rather use some automatic mechanism instead of having lots of configuration for all sorts of file extensions)
as long as it is not an actual bottleneck, we do not need to put a lot of effort into optimizing it

That said, I recently found a use case for a fixed block size chunker to work on lvm-snapshot differences, see #1086.

callegar commented 5 years ago

we already had some file-extension based compression dispatcher, but later decided to rather use some automatic mechanism instead of having lots of configuration for all sorts of file extensions

Can you please provide some insight on the automatic mechanism?
About the lots of configurations for all sorts of file extensions, my suggestion was actually to shift the decision to the user: let him/her judge whether based on personal needs it may be worth to switch the chunking parameters (or even the chunking algorithm if multiple chunking methods end up being supported) based on the file name. This seems to me very aligned with the borg way, that already provides the user with a wide choice of compression algorithms to chose from based on the user needs, without trying to have a catch all solution.
The reason for the file based chunking seems sufficiently clear to me. The incremental growth of the borg archives is only given by the files that change and among them good deltas are only given by the files that are deltable, which often can provide chunk level deltas only if the chunks are small. The current situation imposes a trade off: if you have some big files, even if these are undeltable and even if these only very rarely change at all, you are compelled to use large chunks for everything otherwise you risk running out of memory when you really cannot afford it. This means getting incremental backup rather than delta-based backup for the things that change and are actually deltable. This interpretation seems consistent with the attic -> borg history, where the attic developer initially chose some chunking parameters that were good for his use case and then borg had to change the defaults because of the trade off. The rationale is to break the trade off.

To me the best idea would be to be able to configure things similarly to what git does with its gitattributes: I.e. having some borg attributes configured once for all at the database level with some associations between file names and chunking parameters and then having a way to override that by dropping a .borgattributes file here and there.

ThomasWaldmann commented 5 years ago

https://borgbackup.readthedocs.io/en/stable/usage/help.html#borg-help-compression

some parts of borg currently assume that chunking does not work on a per-file basis, rather always (or at least per-archive) with the same params. IIRC borg diff and borg recreate.

phiresky commented 5 years ago

I just thought of this issue and #4193 again because I was trying to decide if I wanted to create one borg repo / archive for all my data or multiple for different parts (and for some stuff like git repos using smaller chunking params is very beneficial), and I'm afraid of creating a huge archive with small chunk sizes because it might then grow too large to work at all or very slow (how optimized are the chunk caches now?), and for my media collection the chunk size could basically be 100MB without any problems.

The best solution is probably really to just implement either a global chunkerparams patterns file (like --exclude-from or better but more complicated as @callegar says something like (cascading) gitattributes. The defaults can be decided later.

dragetd commented 5 years ago

A chunk is never larger than a single file - so things like small git files would not change much. I am not perfectly sure how git stores it's blobs, but it might also store the sha hashes which cause a lot of un-deduplicatable chunks. And for the max chunking size this would be 2^23 = 8MB

I think the benefit is actually smaller then we (and also I initially) think/thought.

It would be still a fun thing to try and benchmark.

If you put many machines into the same repository, you also risk loosing all the data if the repository breaks somehow.

phiresky commented 5 years ago

well most data in git is stored in bundles (packs) which is a concatenated list of zlib compressed files and deltas which are easily 100+MB, and sometimes those packs are rewritten when there are too many loose objects, so for large repos there is a lot of chunks of the size of a few KBs that is moved around between different files in different order.

Admittedly I don't have that many huge repos that change all the time though, and larger dev files like node_modules should work fine regardless of chunking settings.

The only thing I actually tried where the difference was large was PostgreSQL backups, where having small chunks made a really large difference (200GB borg repo -> 60GB borg repo by just making chunks shorter).

borgbackup / borg

file type based chunking #1005