borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
10.73k stars 734 forks source link

read relative exclude paths from files within backup source #641

Open yrro opened 8 years ago

yrro commented 8 years ago

I current back up my home machine with rsync and make use of its -F option.

       -F     The  -F  option is a shorthand for adding two --filter rules to your command.  The first time it is used is a short‐
              hand for this rule:

                 --filter='dir-merge /.rsync-filter'

              This tells rsync to look for per-directory .rsync-filter files that have been sprinkled through  the  hierarchy  and
              use their rules to filter the files in the transfer.  If -F is repeated, it is a shorthand for this rule:

                 --filter='exclude .rsync-filter'

              This filters out the .rsync-filter files themselves from the transfer.

Combined with the file ~/.rsync-filter, with the following contents:

- .adobe/Flash_Player/AssetCache/*
- .cache/*
- .devscripts_cache/*
- .googleearth/Cache/*
- .gradle/caches/*
- .gradle/daemon/*
- .gradle/native/*
- .gradle/wrapper/*
- .icedtea/cache/*
- .icedteaplugin/cache/*
- .java/deployment/cache/*
- .javafxcache/*
- .m2/repository/*
- .pan2/*-cache/*
- .thumbnails/*

This has several advantages over Borg's existing features for specifying exclusion paths:

  1. As a user on the system to be backed up, I have a single place to list directories for exclusion. This is more convenient than manually dropping CACHEDIR.TAG files throughout my home directory, becausewhen I clear out a cache directory by removing it, I don't have to remember to recreate it and its CACHEDIR.TAG file.
  2. As an administrator on the system to be backed up, users are able to manage their own list of exclusions patterns without administrative intervension.
  3. As an administrator, I don't have to repeat the mount point where the source files are mounted in each exclusion pattern. This is useful when I'm backing up a consistent snapshot of a filesystem, mounted at /mnt/bsnap/home, rather than /home directly.
ThomasWaldmann commented 8 years ago

937 suggested supporting .gitignore which is somehow the same basic idea.

edgewood commented 8 years ago

It might be good to think about creating some plug points in the file system scanner code, such that users can roll their own "exclude this, too" code if the base exclusions aren't enough.

Either that, or implement reading the list of paths to process from a file, and letting users implement whatever inclusion/exclusion strategy they want out entirely of process.

The downsides of the latter are the increased chance of stale data for users who don't use snapshots, and blowing the FS cache (reading some metadata to generate the file list, but then not reading the rest of the metadata and data in one pass).

edgewood commented 8 years ago

--files-from is #841

guedressel commented 7 years ago

Is there currently a way to specify paths in an --exclude-from-file to be relative to that files path?

ThomasWaldmann commented 7 years ago

@guedressel don't think so.

ostrokach commented 7 years ago

Would this be difficult to add? Any pointers on how to get started?

I am also in a situation where I want to back up the home folders of all the users on our storage cluster, and want the users to be in control of what files get backed up.

ThomasWaldmann commented 7 years ago

@ostrokach maybe have a look at the "--pattern" changes in master branch. check if it already does what you need. if not, it maybe could be added. this stuff is new, so it is easier to change than already released functionality.

TiGR commented 6 years ago

As .gitignore issue was merged here I suppose I could comment here. Generally, you can achieve parsing .gitignore parsing with the following:

  1. Scan included directories for .gitignore files
  2. Create pattern set for each .gitignore, specifying P sh and R dirname, where dirname is directory where .gitignore was found in.
  3. Parse .gitignore, for each line:
    • Trim it
    • Skip if empty or starts with #
    • If a line starts with ! replace it with +, otherwise prepend line with -
  4. Done

Now include that pattern set into the runtime config. You should be fine.

I've made something similar in bash script that runs before borg and generates exclude list. But I use --exclude-from, not patterns-from, as borgmatic that I use to configure my backups does not support it.

Enerccio commented 6 years ago

There really should be a way to specify the filtering via .gitignore ...

piegamesde commented 5 years ago

How about this: we add two options, --exclusion-marker-file and --exclusion-generator-script. Say I set --exclusion-marker-file=.gitignore and --exclusion-generator-script="bash gitignore.sh". Now every time a folder containing a .gitignore is found, borg will call bash gitignore.sh /path/to/that/.gitignore. The called script will print an exclusion file for that folder into stdout.

witten commented 5 years ago

I've made something similar in bash script that runs before borg and generates exclude list. But I use --exclude-from, not patterns-from, as borgmatic that I use to configure my backups does not support it.

Note that borgmatic does support specifying patterns_from now.

guedressel commented 5 years ago

By the way: I use pathspec for a small script of mine. Works fine! https://pypi.org/project/pathspec/

piegamesde commented 5 years ago

Regarding my suggestion above: It would change how tagged files work. Instead of saying "hey, I'm tagged, exclude me", dir_is_tagged should somehow generate a list of patterns to be included. This should be done by calling a script (or maybe a python function that may call a script), to allow for maximum configuration. Either this list of patterns is added to the PatternMatcher, or we make the pattern matching stack-based. The latter could improve performance, but would require more work to implement it.

This change is backward-compatible because the old marker files become marker files that ignore everything in that directory. (It would even make the option to keep the marker files themselves obsolete). But this feature could be implemented completely independent of the other as well.

ThomasWaldmann commented 5 years ago

I'ld rather not call a script. borg often runs as root and calling external scripts can be a security issue.

piegamesde commented 5 years ago

Maybe "calling a script" is a bad way to phrase it. What I mean with it is to have the possibility to call an external command that does the job like with BORG_PASSCOMMAND. The command takes in a path as argument or via stdin and generates a list of exclusions, either to stdout (preferred) or to an external file.

ThomasWaldmann commented 5 years ago

Maybe I misunderstood your suggestion. Calling one specific, admin-configured script is not a problem usually (as the admin is responsible for having safe permissions on that), but if we would discover such scripts on the fs like we do with the exclude tags, that might easily become a security issue.

piegamesde commented 5 years ago

Say I set --exclusion-marker-file=.gitignore and --exclusion-marker-command=my-gitignore-to-excludes. Now if Borg encounters any file called .gitignore, it will call my-gitignore-to-excludes /path/to/gitignore/that/was/found/.gitignore, which in return may print something like

/path/to/gitignore/that/was/found/bin/
/path/to/gitignore/that/was/found/build/
/path/to/gitignore/that/was/found/*.class

to standard out which then will be ignored.

(Note that everything here is just an idea and I'm absolutely open on the details of the implementation)

piegamesde commented 5 years ago

I am going to abandon this feature for now. Since I do not have the brain power to process borg's core backup code yet to add a new feature, I will hack together a solution using a preprocessor that generates a custom exclusion file by walking the file tree before calling borg.

If someone wants to implement this feature, I will be happy to help as much as I can.

piegamesde commented 5 years ago

If anyone's interested: I've written a small script to exclude gitignored files:

#!/bin/bash

# Arguments: a path to check for
# Output: all ignored files and folders in all git repositories in the input folder as borg ignore pattern

# Iterate through all directories that contain a .git folder.
# Warning: This will result into invalid patterns if the folder is not a valid git repository (grep fatal to find them out)
for p in $(find $1 -name ".git" | xargs dirname)
do
        # Keep the last folder in mind to skip redundant subfolder exclusions
        LASTFOLDER="$p/.foldernamethatwonteverexist/"
        # Loop list all files of the current repository and ask git if they are ignored
        tree -f -i -x --noreport $p | git -C $p check-ignore --stdin | while read -r q
        do
                # Skip folders that are subfolders of the last skipped folder; print the final result to stdout
                if [[ $q == $LASTFOLDER* ]]; then
                        continue
                elif [[ -d $q ]]; then
                        LASTFOLDER=$q
                        echo "pp:$q/"
                else
                        echo "pf:$q"
                fi
        done
done

Given a path as argument, it will recursively search for git projects in it. It will then list all files in those git projects and filter them if they are not gitignored. The remaining paths are processed to a borg exclude file written to stdout.

On my Documents folder, it takes only three seconds to run, which is acceptable for me.

biocrypto730 commented 5 years ago

Why would i want to back up my dependencies in my git repos? This is pretty gross that it doesnt work. Does nobody from borg use javascript / have a node_modules folder?

koernchen02 commented 4 years ago

Sorry to bump this issue, but I came here by searching for something like borg ignore directories by name and I do have to handle lots of same (unique) name directories (like the mentioned _nodemodules) that I want to exclude from my backup. I just tried a little and it wasn't to complicated to exclude all of them recursively... For testing purposes I created a small directory structure as follows:

tree backup_test/

backup_test/
├── node_modules
│   └── test_in_node_modules.txt
├── subdir
│   ├── node_modules
│   │   └── test_in_node_modules.txt
│   ├── subsubdir
│   │   ├── node_modules
│   │   │   └── test_in_node_modules.txt
│   │   └── subsub_test.txt
│   └── sub_test.txt
└── test.txt

By using the create command with the following exclude option I was able to exclude all node_modules directories: borg create --exclude 'sh:**/node_modules' borgrepo::1 backup_test

Backup lists as follows:

borg list borgrepo::1

drwxr-xr-x user   users         0 Sun, 2019-10-20 22:27:24 backup_test
drwxr-xr-x user   users         0 Sun, 2019-10-20 22:27:32 backup_test/subdir
drwxr-xr-x user   users         0 Sun, 2019-10-20 22:27:46 backup_test/subdir/subsubdir
-rw-r--r-- user   users         0 Sun, 2019-10-20 22:27:46 backup_test/subdir/subsubdir/subsub_test.txt
-rw-r--r-- user   users         0 Sun, 2019-10-20 22:27:32 backup_test/subdir/sub_test.txt
-rw-r--r-- user   users         0 Sun, 2019-10-20 22:27:24 backup_test/test.txt

I'm not sure if I'm missing something, but I think for my primitive use case that seems to be enough. Best regards and thanks to everyone involved in borgs development, it's such an awesome tool and I'm loving it so far! :heart:

borg 1.1.10

ngotchac commented 4 years ago

I came up with a few lines of bash in my backup script that does just that, if anyone is interested :

find /home/user/Workspace -type f -name ".gitignore" -printf "%h\n" | \
    xargs -I '{}' bash -c "egrep -v '^(\s*|#.*)$' \"{}/.gitignore\" | awk '{print \"{}/\" \$0}' " \
    > /tmp/exclude-backup

borg create [...] \
    /home/user \
    --exclude-from /tmp/exclude-backup \

This creates a file at /tmp/exclude-backup with the list of all the concatenated .gitignores content, and use it as an exclude list for borg.

argv-minus-one commented 4 years ago

@biocrypto730: It isn't safe to exclude all folders named node_modules, nor to exclude everything matched by .gitignore, because some of them need to be backed up anyway.

  1. Anything you install using npm install --global is placed in a global node_modules folder, by default /usr/local/lib/node_modules (POSIX) or %AppData%\npm\lib\node_modules (Windows).

  2. If you install a Node application from a single archive file, the archive will probably contain a node_modules folder pre-populated with all of the app's dependencies. (Electron apps usually bundle all of that into a .asar file instead, but that only exists in Electron, not vanilla Node.)

  3. If you use Visual Studio Code, Code itself does not have a node_modules folder, but each extension you install does have one. Some extensions also, stupidly, contain a .gitignore file, including Red Hat's XML extension.

  4. If you've “installed” a package by checking out its source tree and running it directly from there (as opposed to installing it with npm install --global, make install, or the like), then it will contain build artifacts (node_modules for Node packages, executables for C/C++ packages, and so on) that need to be preserved. This is not common with Node packages (everyone uses npm or yarn nowadays), but C/C++ packages are sometimes used this way.

If Borg has to be specifically told to honor version control ignore files, and the documentation specifically warns not to use that option if you semi-install things as described in item 4, then it's safe to do that. But it's not safe as a default behavior.

It should always be safe to read exclude paths from files within the backup source, if the file is named something like .backupignore or .borgignore. But just because something is a build artifact and/or excluded from version control doesn't mean it should be excluded from backup.