Closed ThomasWaldmann closed 2 years ago
Yeah, please implement it the classical way, as you suggested. I hate the rather strange ::
thing, and I always confuse the order of them...
Thanks for the feedback. As a help for you until this is implemented: "A::B" is usually used to say "B" is in scope of "A", so A is always the "container / namespace". In Python, one would say "A.B". Of course, one always begins with the toplevel "container / namespace".
i would suggest borg --repo REPO --archive ARCHIVE command ...
with support for getting both variables from the env
@RonnyPfannschmidt sure, makes sense to move this to global options so we do not duplicate it in every command description.
@ThomasWaldmann it would also lend itself to formulate a click command group
Shouldn't required paramters be positional arguments instead of --options?
In the context of "borg create", not sure how well it would work for everything else:
That would simplify usage to:
borg create $repo $paths
# only if you need --prefix functionality
borg create --prefix=laptop__ $repo $paths
# only if you *really* want to specify the entire thing yourself:
borg create --archive="laptop__{now:%Y-%m-%d_%H:%M:%S}"
I like the added flexibility of pattern matching for --archive-match (but don't need it myself). I do use BORG_REPO, and therefore would not like positional parameters. I thought the usage of :: was a neat way to solve the positional parameter problem and also not requiring --option indicators. Why is the parser to split up repo and archive name so complex? Isn't it just splitting at :: ??
@pepa65 see yourself: https://github.com/borgbackup/borg/blob/master/borg/helpers.py#L712
That doesn't look too bad to me! And keeping this is good for backwards compatibility.
How about this:
name
property could be not present by default and could optionally be set by giving an option on the command line, like borg create --set name:mymachine-home-TIMESTAMP REPO /home
--set tags:tag1,tag2,tag3
which would be split at the commas--name=...
and --tags=...
.Tried keeping repository as a positional arg and adding --name option for the archive name. #6766
Due to the argparse limitation (see "order matters" in the docs), this leads to strange command lines like:
borg create --name=myarchive /my/repo /home /etc # parses, but feels strange
borg create /my/repo /home /etc --name=myarchive # parses, but feels strange
This reads best, but does not work:
borg create /my/repo --name=myarchive /home /etc # does not parse
To solve, we could really consider --repo
as an option:
borg create --repo=/my/repo --name=myarchive /home /etc
In fact, the repository can be optional on the command line if BORG_REPOSITORY=/my/repo
is given via the environment. Having it as an option would also not require the ::
hack to put something into the positional argument's place if the real value should be taken from the env.
I had a look how restic
does this:
-r
or --repo
or env var gives the repositoryrestic -r REPO extract 1234abcd
Current state of this in PR #6766:
borg --repo=MYREPO init --encryption=none
borg --repo=MYREPO list
borg --repo=MYREPO create # borg will make up a name from hostname and timestamp
borg --repo=MYREPO create --name=MYARCHIVE
borg --repo=MYREPO create --name=MYARCHIVE2
borg --repo=MYREPO list --name=MYARCHIVE
borg --repo=MYREPO diff --name=MYARCHIVE --name2=MYARCHIVE2
borg --repo=MYREPO delete --name=MYARCHIVE
borg --repo=MYREPO delete
borg -r MYREPO ... # short alias for --repo
export BORG_REPO=MYREPO
# same commands as above, but one can leave away the --repo=MYREPO
Hmm, guess i don't really like these --name
and --name2
options.
borg create
has a somehow sane default for the archive name, so it does not really require giving a name. But I think this is a minor thing and only addresses the simplest use cases, we also could just require the archive name as a positional argument there.
OTOH, most other commands working with archives require one or two archive names, so they could be positional args also, like borg --repo=REPO diff archive1 archive2
.
But, there are some commands where not giving the archive name switches the command to another mode, e.g. borg list
can either list the repo (giving archives) or list an archive (giving files), depending on whether the archive name is given or not.
Shall we just make separate commands for these modes? Like borg check-repo
? Or subcommands, like borg check repo
? borg check has 3 modes btw, repo only, archives only and everything.
ideas:
(no -a given) -> match all archives (same as borg 1.2 behaviour)
-a specific_arch_name -> match only this archive (use this instead of positional param)
-a 'prefix*' -> match all archives matching the glob
-a '*' -> match all archives (this form is required by `borg delete` to not delete all by default)
all commands below given without -r REPO (assume BORG_REPO=... is in the environment) for brevity.
borg create ARCH [p1 p2 ...]
borg rcreate # (was: borg init)
# note: renamed command to complement rdelete
borg list ARCH
borg rlist # (was: borg list REPO)
# note: new command cleans up / simplifies the argparser / help
borg info [-a ARCH_GLOB]
borg rinfo # (was: borg info REPO)
# note: new command cleans up / simplifies the argparser / help
borg delete [-a ARCH_GLOB] # or rather "destroy" as opposite of create?
borg rdelete # (was: borg delete REPO)
# note: new command cleans up / simplifies the argparser / help
borg recreate [-a ARCH_GLOB] [p1 p2 ...]
borg mount [-a ARCH_GLOB] mntpoint [p1 p2] # (always gives mntpoint/ARCH/..., except for versions view)
borg extract ARCH [p1 p2 ...]
borg check [--repository-only] [--archives-only] [-a ARCH_GLOB]
borg diff ARCH1 ARCH2 [p1 p2 ...]
borg rename OLD NEW
borg prune
borg compact
@RonnyPfannschmidt @elho @textshell @enkore @rumpelsepp @pepa65 any comments?
@m3nu @sophie-h ^ that will be cleaner / more systematic/regular as what we have now, but also means some changes needed in vorta / pika.
Shall we just make separate commands for these modes? Like
borg check-repo
? Or subcommands, likeborg check repo
? borg check has 3 modes btw, repo only, archives only and everything.
Personally I don't like subcommands, and I prefer the simplest user experience out of a CLI. I can see that different modes for the same command could be confusing, but it is intuitive and easy to remember. Otherwise you just get more errors (using borg check-repo ARCHIVE
needs to return an error, while both borg check REPO
and borg check ARCHIVE
just work.
At Vorta we already keep archive name and repo separate in most places. So not a very large change. But it will need some conditions to support older and newer versions simultaneously.
Also wanted to point out that Borgmatic already uses the syntax suggested here. E.g.
usage: borgmatic extract [--repository REPOSITORY] --archive ARCHIVE ...
What you arrived at in https://github.com/borgbackup/borg/issues/948#issuecomment-1159750725 seems like the best suggestion to me so far, because if we break the CLI in a way that requires every consumer to touch basically all commands, we might as well use that for more than just removing a "::".
I like the destroy
/ delete
, archives
/ list
and stats
/ info
split in particular. Destroy/delete is perfect, archives/list is very clear as well. Info/stats is less clear.
recreate
has always been a very bad name, this is a good opportunity to replace it. Maybe filter-archives
or something like that. It's also likely one of the worst commands in the CLI because it can and will do very different things depending on options, of which it has many (largely inherited from create
, and some of its own), and which also interact in complex ways.
Keeping check
as one is probably okay, this is a rarely used command and most of the time both a "check-repository" and "check-archives" (or similar) would be used one after the other anyway.
Yeah, guess we keep check in one piece. Ideally, it checks both repo and archives and only does partial checks on special request (using the options, as now).
stats: did not come up yet with a better name.
Also, I just noticed: if borg info
requires the -a ARCH_GLOB
option to work on one/some/all archives, what if the -a ...
is not given? Is then maybe the global repo stats desired or do we list per-archive stats for all individual archives?
That comes back to defining what a missing -a ...
shall mean: "all archives" or "no archives" or "repo"...
For borg info
without -a
I would prefer all archives AND repo (or at least all archives, rather than having to specify each one).
The options -a/--glob-archives
, --first
, --last
, --sort-by
, --consider-checkpoints
are usually handled by Archives.list_considering(args)
.
First, match -a/--glob-archives
, then --consider-checkpoints
, then --sort-by
(default: sort by timestamp), then apply --first/--last
filters.
The default for -a
is None and the code makes *
from that. Thus, not giving -a
means matching ALL archives.
All other mentioned options further reduce the amount of matched/selected archives. Only exception is --consider-checkpoints
which by default reduces the selected archives by omitting all checkpoint archives.
if we extend borg delete
with the -a
option, the default of "match ALL" (if the option is not given) will result in the interesting behaviour of deleting all archives by default.
but, if we look at the borg 1.2 behaviour, not giving the ::archive
meant "delete the whole repo".
borg asks in such a case whether the user really wants to do that.
Various implementations of rm(1) and some shells will ask if you are sure about doing stupid things like rm -rf /
or rm *
, so borg delete [eol]
going "Buddy, if you really wanna hollow this repo out, you'll have to say it the long way with -a*
" is totally reasonable.
updated https://github.com/borgbackup/borg/issues/948#issuecomment-1159750725 .
borg archives
could be also borg rlist
.
I used rdelete
(repo delete) and rinfo
(repo info) already.
About borg delete
(no options given):
if args.glob_archives is None and args.first == 0 and args.last == 0:
self.print_error("Aborting: if you really want to delete all archives, please use -a '*' "
"or just delete the whole repository (might be much faster).")
return EXIT_ERROR
updated https://github.com/borgbackup/borg/issues/948#issuecomment-1159750725 .
borg archives
--> borg rlist
borg init
--> borg rcreate
Idea from Juerd on IRC:
My take on archive metadata:
Like tags?
I've recently looked how restic handles this. their archives (called snapshots there) do not have a name, just a hash.
They automatically save hostname, user, timestamp and source paths into metadata (and they also support tags).
Found that an interesting approach, but with some issues:
--paths-from-(stdin|command)
.Like tags?
No. One clear identifier that tells you at which set of archives you usually would apply your purge. If you just use random tags it again opens the opportunity for confusion in configs. In a GUI you don't have one clear identifier that you can generate and expose to the user. Pika has a feature to set up backup configurations based on existing archives in the repo, but you can't guess what should be used for purge
.
I think it should be one defined identifier that replaces the current use of prefixes.
OK, so it is a groupid, sequenceid, datasetid, ... (just searching for a good name).
BTW, there is another place where such an id would be useful: to identify a specific (partial) files cache (in that case, datasetid would make sense, because the files cache depends on the specific set of input data).
For me, names always have been more an annoyance than a feature because they are usually redundant.
Have to agree with @sophie-h here. When looking at a random list of archives in Vorta, it basically just shows the date:
They automatically save hostname, user, timestamp and source paths into metadata (and they also support tags).
This sounds sensible. Duration and change size (or similar) could be regarded as metadata too. Allowing just one tag would keep it simple.
I think it should be one defined identifier that replaces the current use of prefixes.
Need not be one, as people have different workflows. Some use hostname (with prefixes currently), others just the time. So I think this is worth considering:
2022-06-18-094046
Playing with possible commands:
borg create --tag=scheduled
borg prune --keep-daily=7 --hostname=srv1
borg prune --keep-last=3 --tag=scheduled
borg prune --keep-last=5 --user=joe
Need not be one, as people have different workflows. Some use hostname (with prefixes currently), others just the time. So I think this is worth considering:
Just to be clear: I don't want to remove the other filter features from prune
. I just want that there is a default way for the most typical use case that's upfront in Vorta, Borg, etc
When pruning with a hostname/username/tag based subset of all archives, there is some risk that it matches more than one sequence of that host/user/tag (similar issue like forgetting to give the correct --prefix), leading to unwanted deletion of the wrong archives.
We could change borg create NAME
to borg create DATASETID
.
The generated archive name would then be f"{DATASETID}-{now}"
- so it is unique and gives a similar user experience to what users are used to from borg 1.x. borg would write the data set id also to archive.metadata['datasetid'] (or even to the manifest entries) so it is directly available for pruning. For partial files cache loading, borg create would just load f"files.{DATASETID}"
instead of the global contains-everything files
cache.
The user would be required to define distinct datasetids for each different way they invoke borg create
.
borg pune --prefix X
would then become borg prune DATASETID
.
Better name than datasetid?
We could change borg create NAME to borg create DATASETID. Better name than datasetid?
So the benefit would be to always use the same datasetid/name and get the date appended automatically?
$ borg create srv1.example.com
instead of
$ borg create "srv1.example.com-{now}"
Pretty small benefit at the cost of explaining a new term and making it harder to understand. And the same behavior is already possible with placeholders in the archive name. I even imagine people would want to customize the timestamp to be appended or turn it off. So even more options and complexity.
Given all that, I find the current behavior preferable. Or anything I missed?
Thinking further: Let's say the current archive name becomes a dataset ID or archive group. Then users would need to refer to an individual archive by some hash (which Borg already generates) or look up f"{DATASETID}-{now}"
, rather than the archive name they gave?
This is similar to how Restic and Kopia do things, except that they use a shorter ID. Also similar to Git commit IDs.
So the real question is: Should the user give the unique identifier when creating an archive or something else? (like the dataset/archive group). Using a generated identifier may be cleaner than something user-provided, like borg create "Blah blah xyzü"
. If we decide to always generate the identifier, I'd prefer a short hash to f"{DATASETID}-{now}"
.
Here an example for illustration and brainstorming:
Create, list, extract, delete
$ borg create /var/www # single path, no archive group set
$ borg create --group var-lib /var/lib # set archive group
$ borg create --comment "before updating openssl" /var/lib/openssl # pass comment to archive
$ borg list
| ID | Date | Host | User | Group | Paths | Comment |
|----------|---------------------|------|------|---------|------------------|-----------------|
| 40dc1520 | 2015-05-08 21:38:30 | srv1 | root | var-lib | /var/lib | |
| bdbd3439 | 2015-05-08 21:40:19 | srv1 | root | | /var/www, /root | |
| 9f0bc19e | 2015-05-08 21:42:19 | srv1 | root | | /var/lib/openssl | before updating |
$ borg extract 40dc1520 var/lib/foo
$ borg delete 40dc1520
Prune
$ borg prune -v --list --dry-run --keep-daily=7 # applies to all archives
$ borg prune --keep-daily=7 --group=var-lib # prune within one archive group
Summarizing suggested changes, if the dataset-ID suggestion moves forward:
Benefits over current way of doing things:
{hostname}-{now}
is used.Many years ago there was the idea of tags where iirc there were two proposals, one just plain tags, and the other being essentially key-value pairs. This sounds like a specialization of the latter, where Borg defines the available keys and values (Host
, User
, Group
, Paths
, Comment
).
The main advantage of defining this metadata through Borg, instead of creative archive names (which became more powerful over time with archive name globbing and so on), is that frontends should have a much easier time working with this.
I don't think it meaningfully improves or detracts from the backup UX of people using Borg directly, because before Borg was conceptually very simple ("A repository is a bunch of tars in a box"), and with this Borg gains the conceptual complexities of traditional backup tools (rsnapshot, bacula etc.) where there's datasets, groups, schedules and so on. To me it seems to be net-zero in this area.
This would also mean Borg becoming more narrow in purpose and usage, and more specialized to "the typical backup workflow" (as defined here) - which is good for those using it that way, and not so good if not. I've used Borg for archiving purposes (and continue to do so), where it is a decent solution because there still is no portable, checksumming FS. (In fact I still have repositories formatted with the Borg patch I made ages ago that allows hierarchical archives - I can tell you from long-term usage that the concept works very well).
Agree that changing prefixes to groups/datasets doesn’t improve the experience for those used to building complex prefixes. It may make it easier to get started for new users and those without much need for archive names.
Adding the “free” metadata, like hostname, user and paths (in addition to date) is a smaller change and may enable new features later. This also doesn’t interfere with other uses.
Using some internal ID as primary key needs more consideration. Just suggesting it here.
If we want to keep archive names and prefixes as they are, here a minimal non-breaking change, which would enable richer UIs:
Let’s see what @ThomasWaldmann and @sophie-h think. This is all building on their suggestions.
hostname: iirc, we already store that into metadata, i just see some formatting issue when trying to output that into a table (short names no problem, but for uniqueness we rather want the fqdn and that tends to be rather long). also there is the problem that uniqueness is not guaranteed here (not at all for the short name and in the worst case not even for the fqdn).
paths: same table formatting issue. works nice with a few paths (as shown above), ugly with many paths and impossible when feeding individual paths (as I pointed out above).
the main reason (and a definite advantage) for a datasetid (archive group id) is to have a value that can be used without pattern matching and also to remove a dangerous usability trap we have in current versions:
For very simple use cases, users could always give datasetid == "all" or "mymachine" and it would behave the same as now.
About hex ids vs archive names:
@enkore do we have an issue here about that idea / patch?
repositories formatted with the Borg patch I made ages ago that allows hierarchical archives
Hmm, guess i don't really like these
--name
and--name2
options.
Yeah, no matter wether its --name
or --archive
, it is quite obnoxious having to always type it, when manually messing around with archives.
Shall we just make separate commands for these modes? Like
borg check-repo
? Or subcommands, likeborg check repo
? borg check has 3 modes btw, repo only, archives only and everything.
Generally, making things consistent is good, but making things more complicated and counter-intuitive just for that reason makes no sense, IMNSHO.
90% of the time one invokes borg info
manually, it is to - after a couple seconds that feel long enough - see the repo stats to get to know the total size or drool over how much compression and deduplication save you. :wink: 10% may be (still feels way to high from my personal experience) to look at the stats of a given archive, to e.g. see how much bigger the latest one is compared to some earlier, or sth.
In a script parsing --json
output all the stas of all archives can well be of interest, too, but I strongly doubt, any interactive user who just types borg info
- whether an old user used to that or some new user who never used pre-2.0, but only vaguely remembers there was some command along the lines of info - to find himself wait for many minutes to then have the equivalent of borg info ::archive
dumped to his terminal for hundrets of archives, would agree that an implied -a '*'
was a sane default for this specific command.
I also strongsy doubt that few would disagree that doing rinfo
instead is cumbersome.
Similar with list, borg list
is used a lot to just see the archives that are there, inspecting the contents happens less often, but when it does, all a single command (with split out --repository REPO
option that is hardly ever used, because export BORG_REPO
once is so much more convenient) involes is pressing cursor up to get the borg list
one did to see the archives back from shell history and then copy&paste one of the archive names after it, done, easier than ever not having t o type the double-colon.
In case of borg delete
I'd just also have that spit out help, allow people to give '*' if that's their rare use-case and have a --desete-repo
option.
We could change
borg create NAME
toborg create DATASETID
.The generated archive name would then be
f"{DATASETID}-{now}"
- so it is unique and gives a similar user experience to what users are used to from borg 1.x.
This is not at all similar or desirable (or usable, I would personally argue) for anyone who did not name his archives ""something-{now}". Even when ending the archive name with a timestamp, formatting could be dosired different.
Also, the timestamp when a given instance of borg was run could not be of lesser relevance for the archive name for me, the timestamp of when the undersying filesystem snapshot was made is what provides meaningful information from when the data in that arhive is, across all repos that same set of data is backed up to, or later migrated to.
Similarly, I still want to be able put the intended hostname into the archive name, in times of moving/replacing systems, where {hostname}
or anything put into separate meta-data may still be hostname-new.
Many years ago there was the idea of tags where iirc there were two proposals, one just plain tags, and the other being essentially key-value pairs. This sounds like a specialization of the latter, where Borg defines the available keys and values (
Host
,User
,Group
,Paths
,Comment
).
The latter is a specialization of the former, which still allows anyone to use tags like hostname:foo
or dataset:homedirs
without preventing others to just use homedirs
if they desire and find that more practical with their workflow.
Calling the third a specialisation is as stretch, the idea con notated with "tags" is that they are user-defined, the whose idea is to adress the problem that a fixed scheme one person came up with does not necessarily apply to the needs of another.
The main advantage of defining this metadata through Borg, instead of creative archive names (which became more powerful over time with archive name globbing and so on), is that frontends should have a much easier time working with this.
Having hostame, user, time etc. as borg sees it in the meta-data for someone to use is what we have and what shousd not be takesn away when adding tags, but the request we saw here was to add a special meta-data item for one frontend, that no-one else may use. And that is where tags shine, that frontent could just set a org.fancyborgfrontent.backup-group-id:foo
tag during all its creates and only ever prune with the according --tag
parameter.
(My backup script happily uses zfs's user properties when creating snapshotsc as an added safety-guard to only delete those after borg is done backing up).
Better name than datasetid?
Maybe "series" / "series name"? (although "data set" does seems a nice alternative name for the "series" concept, below)
I've been using Borg mostly via a Bash script (soon to be rewritten in Python and made public) — one of my main motivations was to conveniently handle what I called "archive series" within repositories. using a configuration file where I specify
I can then run commands like
anbackup create REPO_NAME SERIES_NAME [SERIES_NAME ...]
In repository REPO_NAME, create a backup archive for each specified series.
Borg archive creation options must be set in the configuration file.
and
anbackup list REPO_NAME [SERIES_NAME] [BORG_OPTIONS] [::ARCHIVE_NAME]
If no ARCHIVE_NAME is specified, list all archives in repository REPO_NAME,
or only those from series SERIES_NAME (SERIES_NAME as 2nd argument is the
same as Borg option -P SERIES_NAME- ).
If an ARCHIVE_NAME is specified, list the contents of that archive
(in ths case, a SERIES_NAME argument will be ignored).
From anbackup help concepts
:
Concepts and conventions:
Data is stored in Borg *repositories* (local or remote).
Each repository holds multiple *archives* (individual backups).
Borg performs deduplication across archives in each repository
(there is no deduplication between repositories).
Generically, each Borg archive may have an arbitrary name and contain an
arbitrary collection of files. Our choice in this script is to handle backup
archives as structured into *archive series*, with the following conventions:
- archive series names: each archive series has a *series name*
(allowed character set: 'a'-'z', '0'-'9', '_', with the first character
restricted to 'a'-'z');
- archive names: each archive is named concatenating its series name
with a compact ISO 8601 representation of its (approximate)
creation UTC datetime, separated by '-', e.g. "homedirs-20161002T153554Z";
- archive content: an archive series should be a coherent sequence of
backup archives – typically, all archives in a set should refer to
the same group of directory/file trees.
The parser to split up repo and archive name into all needed parts is rather complex.
Also, some commands (prune) have a separate
--prefix
argument, which is kind ofarchivename*
.The repo part can also come from BORG_REPO env var.
Native windows support (see "windows branch") might even make it more complex, due to different matching patterns needed for it.
So, if we refactor this (which is a major cli api change, this the 2.0 milestone), it could look like:
ARCHIVE_PATTERN would support glob patterns on the archive name.
Additionally to
--archive-match
, we could support a--index [from:to]
option that just results into that part of the match result list.Support getting REPO and ARCHIVE from the environment.