borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
11.22k stars 743 forks source link

split repo and archive name into separate args? #948

Closed ThomasWaldmann closed 2 years ago

ThomasWaldmann commented 8 years ago

The parser to split up repo and archive name into all needed parts is rather complex.

Also, some commands (prune) have a separate --prefix argument, which is kind of archivename*.

The repo part can also come from BORG_REPO env var.

Native windows support (see "windows branch") might even make it more complex, due to different matching patterns needed for it.

So, if we refactor this (which is a major cli api change, this the 2.0 milestone), it could look like:

borg --repo REPO --archive ARCHIVE command
borg --repo REPO --archive-match ARCHIVE_PATTERN command

ARCHIVE_PATTERN would support glob patterns on the archive name.

Additionally to --archive-match, we could support a --index [from:to] option that just results into that part of the match result list.

Support getting REPO and ARCHIVE from the environment.

rumpelsepp commented 8 years ago

Yeah, please implement it the classical way, as you suggested. I hate the rather strange :: thing, and I always confuse the order of them...

ThomasWaldmann commented 8 years ago

Thanks for the feedback. As a help for you until this is implemented: "A::B" is usually used to say "B" is in scope of "A", so A is always the "container / namespace". In Python, one would say "A.B". Of course, one always begins with the toplevel "container / namespace".

RonnyPfannschmidt commented 8 years ago

i would suggest borg --repo REPO --archive ARCHIVE command ...

with support for getting both variables from the env

ThomasWaldmann commented 8 years ago

@RonnyPfannschmidt sure, makes sense to move this to global options so we do not duplicate it in every command description.

RonnyPfannschmidt commented 8 years ago

@ThomasWaldmann it would also lend itself to formulate a click command group

fxkr commented 8 years ago

Shouldn't required paramters be positional arguments instead of --options?

In the context of "borg create", not sure how well it would work for everything else:

That would simplify usage to:

borg create $repo $paths

# only if you need --prefix functionality
borg create --prefix=laptop__ $repo $paths

# only if you *really* want to specify the entire thing yourself:
borg create --archive="laptop__{now:%Y-%m-%d_%H:%M:%S}"
pepa65 commented 8 years ago

I like the added flexibility of pattern matching for --archive-match (but don't need it myself). I do use BORG_REPO, and therefore would not like positional parameters. I thought the usage of :: was a neat way to solve the positional parameter problem and also not requiring --option indicators. Why is the parser to split up repo and archive name so complex? Isn't it just splitting at :: ??

ThomasWaldmann commented 8 years ago

@pepa65 see yourself: https://github.com/borgbackup/borg/blob/master/borg/helpers.py#L712

pepa65 commented 8 years ago

That doesn't look too bad to me! And keeping this is good for backwards compatibility.

ThomasWaldmann commented 2 years ago

How about this:

ThomasWaldmann commented 2 years ago

Tried keeping repository as a positional arg and adding --name option for the archive name. #6766

Due to the argparse limitation (see "order matters" in the docs), this leads to strange command lines like:

borg create --name=myarchive /my/repo /home /etc  # parses, but feels strange
borg create /my/repo /home /etc --name=myarchive  # parses, but feels strange

This reads best, but does not work:

borg create /my/repo --name=myarchive /home /etc  # does not parse

To solve, we could really consider --repo as an option:

borg create --repo=/my/repo --name=myarchive /home /etc

In fact, the repository can be optional on the command line if BORG_REPOSITORY=/my/repo is given via the environment. Having it as an option would also not require the :: hack to put something into the positional argument's place if the real value should be taken from the env.

ThomasWaldmann commented 2 years ago

I had a look how restic does this:

ThomasWaldmann commented 2 years ago

Current state of this in PR #6766:

borg --repo=MYREPO init --encryption=none
borg --repo=MYREPO list
borg --repo=MYREPO create  # borg will make up a name from hostname and timestamp
borg --repo=MYREPO create --name=MYARCHIVE
borg --repo=MYREPO create --name=MYARCHIVE2
borg --repo=MYREPO list --name=MYARCHIVE
borg --repo=MYREPO diff --name=MYARCHIVE --name2=MYARCHIVE2
borg --repo=MYREPO delete --name=MYARCHIVE
borg --repo=MYREPO delete

borg -r MYREPO ...  # short alias for --repo

export BORG_REPO=MYREPO
# same commands as above, but one can leave away the --repo=MYREPO
ThomasWaldmann commented 2 years ago

Hmm, guess i don't really like these --name and --name2 options.

borg create has a somehow sane default for the archive name, so it does not really require giving a name. But I think this is a minor thing and only addresses the simplest use cases, we also could just require the archive name as a positional argument there.

OTOH, most other commands working with archives require one or two archive names, so they could be positional args also, like borg --repo=REPO diff archive1 archive2.

But, there are some commands where not giving the archive name switches the command to another mode, e.g. borg list can either list the repo (giving archives) or list an archive (giving files), depending on whether the archive name is given or not.

Shall we just make separate commands for these modes? Like borg check-repo? Or subcommands, like borg check repo? borg check has 3 modes btw, repo only, archives only and everything.

ThomasWaldmann commented 2 years ago

ideas:

all commands below given without -r REPO (assume BORG_REPO=... is in the environment) for brevity.

borg create ARCH [p1 p2 ...]
borg rcreate       # (was: borg init)
# note: renamed command to complement rdelete

borg list ARCH
borg rlist               # (was: borg list REPO)
# note: new command cleans up / simplifies the argparser / help

borg info [-a ARCH_GLOB]
borg rinfo  # (was: borg info REPO)
# note: new command cleans up / simplifies the argparser / help

borg delete [-a ARCH_GLOB] # or rather "destroy" as opposite of create?
borg rdelete  # (was: borg delete REPO)
# note: new command cleans up / simplifies the argparser / help

borg recreate [-a ARCH_GLOB] [p1 p2 ...]

borg mount [-a ARCH_GLOB] mntpoint [p1 p2]  # (always gives mntpoint/ARCH/..., except for versions view)

borg extract ARCH [p1 p2 ...]

borg check [--repository-only] [--archives-only] [-a ARCH_GLOB]

borg diff ARCH1 ARCH2 [p1 p2 ...]

borg rename OLD NEW

borg prune
borg compact
ThomasWaldmann commented 2 years ago

@RonnyPfannschmidt @elho @textshell @enkore @rumpelsepp @pepa65 any comments?

ThomasWaldmann commented 2 years ago

@m3nu @sophie-h ^ that will be cleaner / more systematic/regular as what we have now, but also means some changes needed in vorta / pika.

pepa65 commented 2 years ago

Shall we just make separate commands for these modes? Like borg check-repo? Or subcommands, like borg check repo? borg check has 3 modes btw, repo only, archives only and everything.

Personally I don't like subcommands, and I prefer the simplest user experience out of a CLI. I can see that different modes for the same command could be confusing, but it is intuitive and easy to remember. Otherwise you just get more errors (using borg check-repo ARCHIVE needs to return an error, while both borg check REPO and borg check ARCHIVE just work.

m3nu commented 2 years ago

At Vorta we already keep archive name and repo separate in most places. So not a very large change. But it will need some conditions to support older and newer versions simultaneously.

Also wanted to point out that Borgmatic already uses the syntax suggested here. E.g.

usage: borgmatic extract [--repository REPOSITORY] --archive ARCHIVE ...

enkore commented 2 years ago

What you arrived at in https://github.com/borgbackup/borg/issues/948#issuecomment-1159750725 seems like the best suggestion to me so far, because if we break the CLI in a way that requires every consumer to touch basically all commands, we might as well use that for more than just removing a "::".

I like the destroy / delete, archives / list and stats / info split in particular. Destroy/delete is perfect, archives/list is very clear as well. Info/stats is less clear.

recreate has always been a very bad name, this is a good opportunity to replace it. Maybe filter-archives or something like that. It's also likely one of the worst commands in the CLI because it can and will do very different things depending on options, of which it has many (largely inherited from create, and some of its own), and which also interact in complex ways.

Keeping check as one is probably okay, this is a rarely used command and most of the time both a "check-repository" and "check-archives" (or similar) would be used one after the other anyway.

ThomasWaldmann commented 2 years ago

Yeah, guess we keep check in one piece. Ideally, it checks both repo and archives and only does partial checks on special request (using the options, as now).

stats: did not come up yet with a better name.

Also, I just noticed: if borg info requires the -a ARCH_GLOB option to work on one/some/all archives, what if the -a ... is not given? Is then maybe the global repo stats desired or do we list per-archive stats for all individual archives?

That comes back to defining what a missing -a ... shall mean: "all archives" or "no archives" or "repo"...

pepa65 commented 2 years ago

For borg info without -a I would prefer all archives AND repo (or at least all archives, rather than having to specify each one).

ThomasWaldmann commented 2 years ago

current behaviour (borg 1.2)

The options -a/--glob-archives, --first, --last, --sort-by, --consider-checkpoints are usually handled by Archives.list_considering(args).

First, match -a/--glob-archives, then --consider-checkpoints, then --sort-by (default: sort by timestamp), then apply --first/--last filters.

The default for -a is None and the code makes * from that. Thus, not giving -a means matching ALL archives.

All other mentioned options further reduce the amount of matched/selected archives. Only exception is --consider-checkpoints which by default reduces the selected archives by omitting all checkpoint archives.

ThomasWaldmann commented 2 years ago

if we extend borg delete with the -a option, the default of "match ALL" (if the option is not given) will result in the interesting behaviour of deleting all archives by default.

but, if we look at the borg 1.2 behaviour, not giving the ::archive meant "delete the whole repo".

borg asks in such a case whether the user really wants to do that.

enkore commented 2 years ago

Various implementations of rm(1) and some shells will ask if you are sure about doing stupid things like rm -rf / or rm *, so borg delete [eol] going "Buddy, if you really wanna hollow this repo out, you'll have to say it the long way with -a*" is totally reasonable.

ThomasWaldmann commented 2 years ago

updated https://github.com/borgbackup/borg/issues/948#issuecomment-1159750725 .

borg archives could be also borg rlist.

I used rdelete (repo delete) and rinfo (repo info) already.

ThomasWaldmann commented 2 years ago

About borg delete (no options given):

        if args.glob_archives is None and args.first == 0 and args.last == 0:
            self.print_error("Aborting: if you really want to delete all archives, please use -a '*' "
                             "or just delete the whole repository (might be much faster).")
            return EXIT_ERROR
ThomasWaldmann commented 2 years ago

updated https://github.com/borgbackup/borg/issues/948#issuecomment-1159750725 .

borg archives --> borg rlist

borg init --> borg rcreate

ThomasWaldmann commented 2 years ago

Idea from Juerd on IRC:

sophie-h commented 2 years ago

My take on archive metadata:

  1. For me, names always have been more an annoyance than a feature because they are usually redundant. I'm happy if I don't have to generate one anymore :laughing:. In a perfect world, I would drop the name altogether, but maybe it's a good way for finding archives if 1.x -> 2.0 migrations will be a thing (via tar or whatever)
  2. What's really important to me is something that you @ThomasWaldmann once called collections (?) iirc. Right now we are often relying on archive prefixes for prune. It's a bit scary. Prefixes are not really a super robust concept. I would really like to see something that is designated as the canonical replacement for prefixes such that GUIs etc. agree on what is the default indicator that something is in the group of archives on which the purge is done.
ThomasWaldmann commented 2 years ago

Like tags?

I've recently looked how restic handles this. their archives (called snapshots there) do not have a name, just a hash.

They automatically save hostname, user, timestamp and source paths into metadata (and they also support tags).

Found that an interesting approach, but with some issues:

sophie-h commented 2 years ago

Like tags?

No. One clear identifier that tells you at which set of archives you usually would apply your purge. If you just use random tags it again opens the opportunity for confusion in configs. In a GUI you don't have one clear identifier that you can generate and expose to the user. Pika has a feature to set up backup configurations based on existing archives in the repo, but you can't guess what should be used for purge.

I think it should be one defined identifier that replaces the current use of prefixes.

ThomasWaldmann commented 2 years ago

OK, so it is a groupid, sequenceid, datasetid, ... (just searching for a good name).

BTW, there is another place where such an id would be useful: to identify a specific (partial) files cache (in that case, datasetid would make sense, because the files cache depends on the specific set of input data).

m3nu commented 2 years ago

For me, names always have been more an annoyance than a feature because they are usually redundant.

Have to agree with @sophie-h here. When looking at a random list of archives in Vorta, it basically just shows the date:

They automatically save hostname, user, timestamp and source paths into metadata (and they also support tags).

This sounds sensible. Duration and change size (or similar) could be regarded as metadata too. Allowing just one tag would keep it simple.

I think it should be one defined identifier that replaces the current use of prefixes.

Need not be one, as people have different workflows. Some use hostname (with prefixes currently), others just the time. So I think this is worth considering:

Playing with possible commands:

borg create --tag=scheduled
borg prune --keep-daily=7 --hostname=srv1
borg prune --keep-last=3 --tag=scheduled
borg prune --keep-last=5 --user=joe
sophie-h commented 2 years ago

Need not be one, as people have different workflows. Some use hostname (with prefixes currently), others just the time. So I think this is worth considering:

Just to be clear: I don't want to remove the other filter features from prune. I just want that there is a default way for the most typical use case that's upfront in Vorta, Borg, etc

ThomasWaldmann commented 2 years ago

When pruning with a hostname/username/tag based subset of all archives, there is some risk that it matches more than one sequence of that host/user/tag (similar issue like forgetting to give the correct --prefix), leading to unwanted deletion of the wrong archives.

We could change borg create NAME to borg create DATASETID.

The generated archive name would then be f"{DATASETID}-{now}" - so it is unique and gives a similar user experience to what users are used to from borg 1.x. borg would write the data set id also to archive.metadata['datasetid'] (or even to the manifest entries) so it is directly available for pruning. For partial files cache loading, borg create would just load f"files.{DATASETID}" instead of the global contains-everything files cache.

The user would be required to define distinct datasetids for each different way they invoke borg create.

borg pune --prefix X would then become borg prune DATASETID.

Better name than datasetid?

m3nu commented 2 years ago

We could change borg create NAME to borg create DATASETID. Better name than datasetid?

So the benefit would be to always use the same datasetid/name and get the date appended automatically?

$ borg create srv1.example.com

instead of

$ borg create "srv1.example.com-{now}"

Pretty small benefit at the cost of explaining a new term and making it harder to understand. And the same behavior is already possible with placeholders in the archive name. I even imagine people would want to customize the timestamp to be appended or turn it off. So even more options and complexity.

Given all that, I find the current behavior preferable. Or anything I missed?

m3nu commented 2 years ago

Thinking further: Let's say the current archive name becomes a dataset ID or archive group. Then users would need to refer to an individual archive by some hash (which Borg already generates) or look up f"{DATASETID}-{now}", rather than the archive name they gave?

This is similar to how Restic and Kopia do things, except that they use a shorter ID. Also similar to Git commit IDs.

So the real question is: Should the user give the unique identifier when creating an archive or something else? (like the dataset/archive group). Using a generated identifier may be cleaner than something user-provided, like borg create "Blah blah xyzü". If we decide to always generate the identifier, I'd prefer a short hash to f"{DATASETID}-{now}".

Here an example for illustration and brainstorming:

Create, list, extract, delete

$ borg create /var/www # single path, no archive group set
$ borg create --group var-lib /var/lib  # set archive group
$ borg create --comment "before updating openssl" /var/lib/openssl  # pass comment to archive
$ borg list
| ID       | Date                | Host | User | Group   | Paths            | Comment         |
|----------|---------------------|------|------|---------|------------------|-----------------|
| 40dc1520 | 2015-05-08 21:38:30 | srv1 | root | var-lib | /var/lib         |                 |
| bdbd3439 | 2015-05-08 21:40:19 | srv1 | root |         | /var/www, /root  |                 |
| 9f0bc19e | 2015-05-08 21:42:19 | srv1 | root |         | /var/lib/openssl | before updating |

$ borg extract 40dc1520 var/lib/foo
$ borg delete 40dc1520

Prune

$ borg prune -v --list --dry-run --keep-daily=7  # applies to all archives
$ borg prune --keep-daily=7 --group=var-lib  # prune within one archive group

Summarizing suggested changes, if the dataset-ID suggestion moves forward:

Benefits over current way of doing things:

enkore commented 2 years ago

Many years ago there was the idea of tags where iirc there were two proposals, one just plain tags, and the other being essentially key-value pairs. This sounds like a specialization of the latter, where Borg defines the available keys and values (Host, User, Group, Paths, Comment).

The main advantage of defining this metadata through Borg, instead of creative archive names (which became more powerful over time with archive name globbing and so on), is that frontends should have a much easier time working with this.

I don't think it meaningfully improves or detracts from the backup UX of people using Borg directly, because before Borg was conceptually very simple ("A repository is a bunch of tars in a box"), and with this Borg gains the conceptual complexities of traditional backup tools (rsnapshot, bacula etc.) where there's datasets, groups, schedules and so on. To me it seems to be net-zero in this area.

This would also mean Borg becoming more narrow in purpose and usage, and more specialized to "the typical backup workflow" (as defined here) - which is good for those using it that way, and not so good if not. I've used Borg for archiving purposes (and continue to do so), where it is a decent solution because there still is no portable, checksumming FS. (In fact I still have repositories formatted with the Borg patch I made ages ago that allows hierarchical archives - I can tell you from long-term usage that the concept works very well).

m3nu commented 2 years ago

Agree that changing prefixes to groups/datasets doesn’t improve the experience for those used to building complex prefixes. It may make it easier to get started for new users and those without much need for archive names.

Adding the “free” metadata, like hostname, user and paths (in addition to date) is a smaller change and may enable new features later. This also doesn’t interfere with other uses.

Using some internal ID as primary key needs more consideration. Just suggesting it here.

If we want to keep archive names and prefixes as they are, here a minimal non-breaking change, which would enable richer UIs:

Let’s see what @ThomasWaldmann and @sophie-h think. This is all building on their suggestions.

ThomasWaldmann commented 2 years ago

hostname: iirc, we already store that into metadata, i just see some formatting issue when trying to output that into a table (short names no problem, but for uniqueness we rather want the fqdn and that tends to be rather long). also there is the problem that uniqueness is not guaranteed here (not at all for the short name and in the worst case not even for the fqdn).

paths: same table formatting issue. works nice with a few paths (as shown above), ugly with many paths and impossible when feeding individual paths (as I pointed out above).

the main reason (and a definite advantage) for a datasetid (archive group id) is to have a value that can be used without pattern matching and also to remove a dangerous usability trap we have in current versions:

For very simple use cases, users could always give datasetid == "all" or "mymachine" and it would behave the same as now.

About hex ids vs archive names:

ThomasWaldmann commented 2 years ago

@enkore do we have an issue here about that idea / patch?

repositories formatted with the Borg patch I made ages ago that allows hierarchical archives

elho commented 2 years ago

Hmm, guess i don't really like these --name and --name2 options.

Yeah, no matter wether its --name or --archive, it is quite obnoxious having to always type it, when manually messing around with archives.

Shall we just make separate commands for these modes? Like borg check-repo? Or subcommands, like borg check repo? borg check has 3 modes btw, repo only, archives only and everything.

Generally, making things consistent is good, but making things more complicated and counter-intuitive just for that reason makes no sense, IMNSHO. 90% of the time one invokes borg info manually, it is to - after a couple seconds that feel long enough - see the repo stats to get to know the total size or drool over how much compression and deduplication save you. :wink: 10% may be (still feels way to high from my personal experience) to look at the stats of a given archive, to e.g. see how much bigger the latest one is compared to some earlier, or sth. In a script parsing --json output all the stas of all archives can well be of interest, too, but I strongly doubt, any interactive user who just types borg info - whether an old user used to that or some new user who never used pre-2.0, but only vaguely remembers there was some command along the lines of info - to find himself wait for many minutes to then have the equivalent of borg info ::archive dumped to his terminal for hundrets of archives, would agree that an implied -a '*' was a sane default for this specific command. I also strongsy doubt that few would disagree that doing rinfo instead is cumbersome.

Similar with list, borg list is used a lot to just see the archives that are there, inspecting the contents happens less often, but when it does, all a single command (with split out --repository REPO option that is hardly ever used, because export BORG_REPO once is so much more convenient) involes is pressing cursor up to get the borg list one did to see the archives back from shell history and then copy&paste one of the archive names after it, done, easier than ever not having t o type the double-colon.

In case of borg delete I'd just also have that spit out help, allow people to give '*' if that's their rare use-case and have a --desete-repo option.

We could change borg create NAME to borg create DATASETID.

The generated archive name would then be f"{DATASETID}-{now}" - so it is unique and gives a similar user experience to what users are used to from borg 1.x.

This is not at all similar or desirable (or usable, I would personally argue) for anyone who did not name his archives ""something-{now}". Even when ending the archive name with a timestamp, formatting could be dosired different. Also, the timestamp when a given instance of borg was run could not be of lesser relevance for the archive name for me, the timestamp of when the undersying filesystem snapshot was made is what provides meaningful information from when the data in that arhive is, across all repos that same set of data is backed up to, or later migrated to. Similarly, I still want to be able put the intended hostname into the archive name, in times of moving/replacing systems, where {hostname} or anything put into separate meta-data may still be hostname-new.

Many years ago there was the idea of tags where iirc there were two proposals, one just plain tags, and the other being essentially key-value pairs. This sounds like a specialization of the latter, where Borg defines the available keys and values (Host, User, Group, Paths, Comment).

The latter is a specialization of the former, which still allows anyone to use tags like hostname:foo or dataset:homedirs without preventing others to just use homedirs if they desire and find that more practical with their workflow. Calling the third a specialisation is as stretch, the idea con notated with "tags" is that they are user-defined, the whose idea is to adress the problem that a fixed scheme one person came up with does not necessarily apply to the needs of another.

The main advantage of defining this metadata through Borg, instead of creative archive names (which became more powerful over time with archive name globbing and so on), is that frontends should have a much easier time working with this.

Having hostame, user, time etc. as borg sees it in the meta-data for someone to use is what we have and what shousd not be takesn away when adding tags, but the request we saw here was to add a special meta-data item for one frontend, that no-one else may use. And that is where tags shine, that frontent could just set a org.fancyborgfrontent.backup-group-id:foo tag during all its creates and only ever prune with the according --tag parameter. (My backup script happily uses zfs's user properties when creating snapshotsc as an added safety-guard to only delete those after borg is done backing up).

jmce commented 2 years ago

Better name than datasetid?

Maybe "series" / "series name"? (although "data set" does seems a nice alternative name for the "series" concept, below)

I've been using Borg mostly via a Bash script (soon to be rewritten in Python and made public) — one of my main motivations was to conveniently handle what I called "archive series" within repositories. using a configuration file where I specify

I can then run commands like

  anbackup create REPO_NAME SERIES_NAME [SERIES_NAME ...]
    In repository REPO_NAME, create a backup archive for each specified series.
    Borg archive creation options must be set in the configuration file.

and

  anbackup list REPO_NAME [SERIES_NAME] [BORG_OPTIONS] [::ARCHIVE_NAME]
    If no ARCHIVE_NAME is specified, list all archives in repository REPO_NAME,
    or only those from series SERIES_NAME (SERIES_NAME as 2nd argument is the
    same as Borg option -P SERIES_NAME- ).
    If an ARCHIVE_NAME is specified, list the contents of that archive
    (in ths case, a SERIES_NAME argument will be ignored).

From anbackup help concepts:

Concepts and conventions:

Data is stored in Borg *repositories* (local or remote).
Each repository holds multiple *archives* (individual backups).
Borg performs deduplication across archives in each repository
(there is no deduplication between repositories).

Generically, each Borg archive may have an arbitrary name and contain an
arbitrary collection of files.  Our choice in this script is to handle backup
archives as structured into *archive series*, with the following conventions:

- archive series names: each archive series has a *series name*
  (allowed character set: 'a'-'z', '0'-'9', '_', with the first character
  restricted to 'a'-'z');
- archive names: each archive is named concatenating its series name
  with a compact ISO 8601 representation of its (approximate)
  creation UTC datetime, separated by '-', e.g. "homedirs-20161002T153554Z";
- archive content: an archive series should be a coherent sequence of
  backup archives – typically, all archives in a set should refer to
  the same group of directory/file trees.