Open andrewchambers opened 3 years ago
Is it possible for you to provide an example script first, that way people can tweak to their requirements easily. Then consider supporting it in the base system at a later date.
A custom retention policy is the only thing stopping me from using bupstash as I take frequent (at max hourly) backups and want to keep in a decaying frequency.
I know that my requirements may differ from the next so could be complex to cover for everyone as a built-in. Example scripts would get us all going and we can set requirements easily. Thanks!
I totally agree, definitely the direction I will try to go. I think I can provide perhaps a python, javascript and shell script example in the future.
@andrewchambers @joshbmarshall I've written a python script for bucket-based retention.
input:
RETENTION_BUCKETS
, with space-separated timedelta:count
pairs. Example: 1H:24 1D:14 1W:8 4W:24 52W:100
. Due to limitation of python's timedelta, the longest period that can be specified is weeks. The format is <n>W<n>D<n>H<n>M
, in that order, e.g. 1W1D
or 1H30M
.bupstash list
output on stdinoutput:
Full example:
bupstash list | RETENTION_BUCKETS="1H:24 1D:14 1W:8 4W:24 52W:100" bupstash-retention | bupstash rm --ids-from-stdin
Disclaimer: Be careful, might delete your backups ;-)
#!/usr/bin/env python3
# vim: ft=python
from collections import defaultdict
from datetime import datetime, timedelta
from fileinput import input
from os import environ
from re import finditer, match
from sys import stderr
re_s = r' ?(?P<key>[^=]*)="(?P<value>[^"]*)"'
re_d = r'((?P<weeks>\d*)W)?((?P<days>\d*)D)?((?P<hours>.*)H)?((?P<minutes>\d*)M)?'
def parse_snapshots(lines):
return [defaultdict(list, {match.group("key"): match.group("value")
for match in finditer(re_s, line)}) for line in lines]
def parse_delta(s):
m = match(re_d, s)
return timedelta(weeks=int(m.group("weeks") or 0),
days=int(m.group("days") or 0),
hours=int(m.group("hours") or 0),
minutes=int(m.group("minutes") or 0))
def parse_timestamp(str):
return datetime.strptime(str, "%Y/%m/%d %H:%M:%S")
snapshots = parse_snapshots(input())[::-1]
buckets = [[parts[0], parse_delta(parts[0]), int(parts[1])] for parts in
[parts for parts in
[bucket.split(":") for bucket in environ["RETENTION_BUCKETS"].split(" ")]]]
for bucket in buckets:
bucket_name, min_delta, bucket_count = bucket
retained_count = 0
last_retained = None
for snapshot in snapshots:
if retained_count >= bucket_count:
break
timestamp = parse_timestamp(snapshot["timestamp"])
if last_retained is None or (last_retained - timestamp) > min_delta:
snapshot["retained"] += [bucket_name]
last_retained = timestamp
retained_count += 1
for sn in snapshots[::-1]:
print(
f'id="{sn["id"]}" timestamp="{sn["timestamp"]}" retained={sn["retained"]}',
file=stderr)
if not sn["retained"]:
print(sn["id"])
There's also a flag - | bupstash rm --ids-from-stdin
which might be slightly faster for many ids, but your way works fine too. Thanks for posting this.
@andrewchambers thanks for the hint, I updated the script above. Feel free to use it in a FAQ or similar. Thanks for your work!
A word of warning: The algorithm here has issues - it keeps dropping snapshots that are good "dailies" because they aren't >=24h apart. It should rather normalize the timestamps according to buckets - eg. for a daily bucket: ["2021/06/12 04:25:23", "2021/06/12 02:32:07"]
should select latest 2021/06/12 04:25:23
for day 2021/06/12
, regardless of time delta to the snapshot selected for 2021/06/13
, etc.
I will rewrite it at some point and update my script, but if you use this, expect it to drop more snapshots than necessary for now.
Thanks for pointing it out, any updates are much appreciated. I definitely will use this or something inspired by this once I clear out my backlog.
This version rounds the timestamps to the given buckets, and takes the most recent per bucket.
N.B.: This is based on the bucket periods since UNIX epoch. In particular, the "1W" bucket starts at 1.1.1970, which was a Thursday. Therefore, the most recent snapshot from a Wednesday is the one that is used for the bucket.
Likewise, as there is no way to specify a year other than e.g. "52W", the bucket breakpoints will obviously not fall remotely close to the calendars years.
I didn't test the new version much, but it's my script for retention now and if you don't hear back from me in a few days it works on my machine :-)
Here is a simulated test run. The bupstash-testdata script generates snapshots randomly every 2h on average, because that happens to be my schedule. The retention rules can be seen in the first line. Looks fine to me.
Cheers
#!/usr/bin/env python3
# vim: ft=python
from collections import defaultdict
from datetime import datetime, timedelta, timezone
from fileinput import input
from os import environ
from re import finditer, match
from sys import stderr
re_s = r' ?(?P<key>[^=]*)="(?P<value>[^"]*)"'
re_d = r'((?P<weeks>\d*)W)?((?P<days>\d*)D)?((?P<hours>.*)H)?((?P<minutes>\d*)M)?'
def parse_snapshots(lines):
return [defaultdict(list, {match.group("key"): match.group("value")
for match in finditer(re_s, line)}) for line in lines]
def parse_delta(s):
m = match(re_d, s)
return timedelta(weeks=int(m.group("weeks") or 0),
days=int(m.group("days") or 0),
hours=int(m.group("hours") or 0),
minutes=int(m.group("minutes") or 0)).total_seconds()
def parse_timestamp(str):
dt = datetime.strptime(str, "%Y/%m/%d %H:%M:%S")
return dt.replace(tzinfo=timezone.utc).timestamp()
snapshots = parse_snapshots(input())[::-1]
buckets = [[parts[0], parse_delta(parts[0]), int(parts[1])] for parts in
[parts for parts in
[bucket.split(":") for bucket in environ["RETENTION_BUCKETS"].split(" ")]]]
for bucket in buckets:
bucket_name, min_delta, bucket_count = bucket
retained_count = 0
last_retained = None
for snapshot in snapshots:
if retained_count >= bucket_count:
break
rounded = int(parse_timestamp(snapshot["timestamp"]) / min_delta)
last_rounded = None if last_retained is None else \
int(parse_timestamp(last_retained["timestamp"]) / min_delta)
if last_rounded is None or rounded < last_rounded:
snapshot["retained"].append(bucket_name)
last_retained = snapshot
retained_count += 1
for sn in snapshots[::-1]:
print(
f'id="{sn["id"]}" timestamp="{sn["timestamp"]}" retained={sn["retained"]}',
file=stderr)
if not sn["retained"]:
print(sn["id"])
Many people like to cycle backups in configurable and complex ways, we should either support this directly, or provide example scripts and a guide showing how to do this easily.