anarcat / bup

Very efficient backup system based on the git packfile format, providing fast incremental saves and global deduplication (among and within files, including virtual machine images). Current release is 0.25, and the development branch is master. Please post patches to the mailing list for discussion (see below).
Other
0 stars 0 forks source link

bup: It backs things up

bup is a program that backs things up. It's short for "backup." Can you believe that nobody else has named an open source program "bup" after all this time? Me neither.

Despite its unassuming name, bup is pretty cool. To give you an idea of just how cool it is, I wrote you this poem:

                         Bup is teh awesome
                      What rhymes with awesome?
                        I guess maybe possum
                       But that's irrelevant.

Hmm. Did that help? Maybe prose is more useful after all.

Reasons bup is awesome

bup has a few advantages over other backup software:

Reasons you might want to avoid bup

Getting started

From source

From binary packages

Binary packages of bup are known to be built for the following OSes:

Using bup

That's all there is to it!

Notes on FreeBSD

Notes on NetBSD/pkgsrc

Notes on Cygwin

Notes on OS X

How it works

Basic storage:

bup stores its data in a git-formatted repository. Unfortunately, git itself doesn't actually behave very well for bup's use case (huge numbers of files, files with huge sizes, retaining file permissions/ownership are important), so we mostly don't use git's code except for a few helper programs. For example, bup has its own git packfile writer written in python.

Basically, 'bup split' reads the data on stdin (or from files specified on the command line), breaks it into chunks using a rolling checksum (similar to rsync), and saves those chunks into a new git packfile. There is one git packfile per backup.

When deciding whether to write a particular chunk into the new packfile, bup first checks all the other packfiles that exist to see if they already have that chunk. If they do, the chunk is skipped.

git packs come in two parts: the pack itself (.pack) and the index (.idx). The index is pretty small, and contains a list of all the objects in the pack. Thus, when generating a remote backup, we don't have to have a copy of the packfiles from the remote server: the local end just downloads a copy of the server's index files, and compares objects against those when generating the new pack, which it sends directly to the server.

The "-n" option to 'bup split' and 'bup save' is the name of the backup you want to create, but it's actually implemented as a git branch. So you can do cute things like checkout a particular branch using git, and receive a bunch of chunk files corresponding to the file you split.

If you use '-b' or '-t' or '-c' instead of '-n', bup split will output a list of blobs, a tree containing that list of blobs, or a commit containing that tree, respectively, to stdout. You can use this to construct your own scripts that do something with those values.

The bup index:

'bup index' walks through your filesystem and updates a file (whose name is, by default, ~/.bup/bupindex) to contain the name, attributes, and an optional git SHA1 (blob id) of each file and directory.

'bup save' basically just runs the equivalent of 'bup split' a whole bunch of times, once per file in the index, and assembles a git tree that contains all the resulting objects. Among other things, that makes 'git diff' much more useful (compared to splitting a tarball, which is essentially a big binary blob). However, since bup splits large files into smaller chunks, the resulting tree structure doesn't exactly correspond to what git itself would have stored. Also, the tree format used by 'bup save' will probably change in the future to support storing file ownership, more complex file permissions, and so on.

If a file has previously been written by 'bup save', then its git blob/tree id is stored in the index. This lets 'bup save' avoid reading that file to produce future incremental backups, which means it can go very fast unless a lot of files have changed.

Things that are stupid for now but which we'll fix later

Help with any of these problems, or others, is very welcome. Join the mailing list (see below) if you'd like to help.

More Documentation

bup has an extensive set of man pages. Try using 'bup help' to get started, or use 'bup help SUBCOMMAND' for any bup subcommand (like split, join, index, save, etc.) to get details on that command.

For further technical details, please see ./DESIGN.

How you can help

bup is a work in progress and there are many ways it can still be improved. If you'd like to contribute patches, ideas, or bug reports, please join the bup mailing list.

You can find the mailing list archives here:

http://groups.google.com/group/bup-list

and you can subscribe by sending a message to:

bup-list+subscribe@googlegroups.com

Please see ./HACKING for additional information, i.e. how to submit patches (hint - no pull requests), how we handle branches, etc.

Have fun,

Avery