Custom backup retention policies

andrewchambers commented 3 years ago

Many people like to cycle backups in configurable and complex ways, we should either support this directly, or provide example scripts and a guide showing how to do this easily.

joshbmarshall commented 3 years ago

Is it possible for you to provide an example script first, that way people can tweak to their requirements easily. Then consider supporting it in the base system at a later date.

A custom retention policy is the only thing stopping me from using bupstash as I take frequent (at max hourly) backups and want to keep in a decaying frequency.

I know that my requirements may differ from the next so could be complex to cover for everyone as a built-in. Example scripts would get us all going and we can set requirements easily. Thanks!

andrewchambers commented 3 years ago

I totally agree, definitely the direction I will try to go. I think I can provide perhaps a python, javascript and shell script example in the future.

pschyska commented 3 years ago

@andrewchambers @joshbmarshall I've written a python script for bucket-based retention.

input:

Environment variable RETENTION_BUCKETS, with space-separated timedelta:count pairs. Example: 1H:24 1D:14 1W:8 4W:24 52W:100. Due to limitation of python's timedelta, the longest period that can be specified is weeks. The format is <n>W<n>D<n>H<n>M, in that order, e.g. 1W1D or 1H30M.
bupstash list output on stdin

output:

Informative output on stderr
Snapshots to delete on stdout

Full example:

bupstash list | RETENTION_BUCKETS="1H:24 1D:14 1W:8 4W:24 52W:100" bupstash-retention | bupstash rm  --ids-from-stdin

Disclaimer: Be careful, might delete your backups ;-)

#!/usr/bin/env python3
# vim: ft=python
from collections import defaultdict
from datetime import datetime, timedelta
from fileinput import input
from os import environ
from re import finditer, match
from sys import stderr

re_s = r' ?(?P<key>[^=]*)="(?P<value>[^"]*)"'
re_d = r'((?P<weeks>\d*)W)?((?P<days>\d*)D)?((?P<hours>.*)H)?((?P<minutes>\d*)M)?'

def parse_snapshots(lines):
    return [defaultdict(list, {match.group("key"): match.group("value")
                               for match in finditer(re_s, line)}) for line in lines]

def parse_delta(s):
    m = match(re_d, s)

    return timedelta(weeks=int(m.group("weeks") or 0),
                     days=int(m.group("days") or 0),
                     hours=int(m.group("hours") or 0),
                     minutes=int(m.group("minutes") or 0))

def parse_timestamp(str):
    return datetime.strptime(str, "%Y/%m/%d %H:%M:%S")

snapshots = parse_snapshots(input())[::-1]
buckets = [[parts[0], parse_delta(parts[0]), int(parts[1])] for parts in
           [parts for parts in
            [bucket.split(":") for bucket in environ["RETENTION_BUCKETS"].split(" ")]]]

for bucket in buckets:
    bucket_name, min_delta, bucket_count = bucket
    retained_count = 0
    last_retained = None
    for snapshot in snapshots:
        if retained_count >= bucket_count:
            break

        timestamp = parse_timestamp(snapshot["timestamp"])
        if last_retained is None or (last_retained - timestamp) > min_delta:
            snapshot["retained"] += [bucket_name]
            last_retained = timestamp
            retained_count += 1

for sn in snapshots[::-1]:
    print(
        f'id="{sn["id"]}" timestamp="{sn["timestamp"]}" retained={sn["retained"]}',
        file=stderr)
    if not sn["retained"]:
        print(sn["id"])

andrewchambers commented 3 years ago

There's also a flag - | bupstash rm --ids-from-stdin which might be slightly faster for many ids, but your way works fine too. Thanks for posting this.

pschyska commented 3 years ago

@andrewchambers thanks for the hint, I updated the script above. Feel free to use it in a FAQ or similar. Thanks for your work!

pschyska commented 3 years ago

A word of warning: The algorithm here has issues - it keeps dropping snapshots that are good "dailies" because they aren't >=24h apart. It should rather normalize the timestamps according to buckets - eg. for a daily bucket: ["2021/06/12 04:25:23", "2021/06/12 02:32:07"] should select latest 2021/06/12 04:25:23 for day 2021/06/12, regardless of time delta to the snapshot selected for 2021/06/13, etc. I will rewrite it at some point and update my script, but if you use this, expect it to drop more snapshots than necessary for now.

andrewchambers commented 3 years ago

Thanks for pointing it out, any updates are much appreciated. I definitely will use this or something inspired by this once I clear out my backlog.

pschyska commented 3 years ago

This version rounds the timestamps to the given buckets, and takes the most recent per bucket.

N.B.: This is based on the bucket periods since UNIX epoch. In particular, the "1W" bucket starts at 1.1.1970, which was a Thursday. Therefore, the most recent snapshot from a Wednesday is the one that is used for the bucket.

Likewise, as there is no way to specify a year other than e.g. "52W", the bucket breakpoints will obviously not fall remotely close to the calendars years.

I didn't test the new version much, but it's my script for retention now and if you don't hear back from me in a few days it works on my machine :-)

Here is a simulated test run. The bupstash-testdata script generates snapshots randomly every 2h on average, because that happens to be my schedule. The retention rules can be seen in the first line. Looks fine to me.

Cheers

#!/usr/bin/env python3
# vim: ft=python
from collections import defaultdict
from datetime import datetime, timedelta, timezone
from fileinput import input
from os import environ
from re import finditer, match
from sys import stderr

re_s = r' ?(?P<key>[^=]*)="(?P<value>[^"]*)"'
re_d = r'((?P<weeks>\d*)W)?((?P<days>\d*)D)?((?P<hours>.*)H)?((?P<minutes>\d*)M)?'

def parse_snapshots(lines):
    return [defaultdict(list, {match.group("key"): match.group("value")
                               for match in finditer(re_s, line)}) for line in lines]

def parse_delta(s):
    m = match(re_d, s)

    return timedelta(weeks=int(m.group("weeks") or 0),
                     days=int(m.group("days") or 0),
                     hours=int(m.group("hours") or 0),
                     minutes=int(m.group("minutes") or 0)).total_seconds()

def parse_timestamp(str):
    dt = datetime.strptime(str, "%Y/%m/%d %H:%M:%S")
    return dt.replace(tzinfo=timezone.utc).timestamp()

snapshots = parse_snapshots(input())[::-1]
buckets = [[parts[0], parse_delta(parts[0]), int(parts[1])] for parts in
           [parts for parts in
            [bucket.split(":") for bucket in environ["RETENTION_BUCKETS"].split(" ")]]]

for bucket in buckets:
    bucket_name, min_delta, bucket_count = bucket
    retained_count = 0
    last_retained = None
    for snapshot in snapshots:
        if retained_count >= bucket_count:
            break

        rounded = int(parse_timestamp(snapshot["timestamp"]) / min_delta)
        last_rounded = None if last_retained is None else \
            int(parse_timestamp(last_retained["timestamp"]) / min_delta)

        if last_rounded is None or rounded < last_rounded:
            snapshot["retained"].append(bucket_name)
            last_retained = snapshot
            retained_count += 1

for sn in snapshots[::-1]:
    print(
        f'id="{sn["id"]}" timestamp="{sn["timestamp"]}" retained={sn["retained"]}',
        file=stderr)
    if not sn["retained"]:
        print(sn["id"])

andrewchambers / bupstash

Custom backup retention policies #144