Add differential/incremental backup script with service splitting

mdop commented 1 year ago

The current backup script always creates full dumps that are temporarily copied, hence taking down the stack for a potentially long time and using a lot of disk space. This backup script aims to reduce stack down time for most services while also reducing disk space requirements. This is accomplished by allowing the user to chose incremental/differential backups instead of full dumps and to make separate backup files for individual services which allows bringing up the other services in the meantime. The disk overhead is only one differential backup. Different usage scenarios for automated backups are described in the backup script file.

Script was tested in mock up folder structure and on a live system, further review and validation highly welcome to ensure data integrity for users!

Paraphraser commented 1 year ago

Perhaps look at IOTstackBackup. Still "full" backups but it doesn't need the stack to be taken down to run the backup.

Slyke commented 1 year ago

As @Paraphraser we should try to avoid bringing the stack down during backups (especially if they are triggered by a cronjob). But services like InfluxDB and Postgres cannot be just "copied" when they are running, as the backed up version may have corrupted data. These services usually have their own internal backup command that can be executed from the host machine to produce a safe backup.

mdop commented 1 year ago

Thank you @Paraphraser for the suggestion, there are some very interesting ideas in the repo. My issue with it is that is heavily relies on IOTstack specific assumtions. If e.g. you run another service that has a mariadb container as backend it will potentially get corrupted. I am also not sure if running a backup of, say, nodered while changes are made would result in a backup the user would expect. Can someone point me to the significant difference between stopping a database, tarballing the storage, and starting and using the internal backup function? Would the database still be responsive? Also what is the difference between stopping nextcloud and putting it into maintenance mode (as in @Paraphraser s Repo)? The service is certainly not usable for the backup duration.

I think having the choice of incremental backups is very interesting when running nextcloud as it can quickly become very time and disk space consuming. It took 15min/45min to create an uncompressed/compressed ~24GB backup on a RPi4. When thinking about 1 TB and above doing regular backups can become disruptive.

Paraphraser commented 1 year ago

I might be misreading your response but I'm left feeling that you might've misunderstood my intention when I pointed you at IOTstackBackup. I wasn't trying to say either "don't submit PR" or that there was anything wrong with your approach. Neither was I suggesting that IOTstackBackup already does what you're proposing (it doesn't). I just wanted to make sure you were aware of the existence of IOTstackBackup in case you wanted to borrow anything from its approach. No more, no less.

My issue with it is that is heavily relies on IOTstack specific assumtions.

Yes. But that's true of all scripts on this repo or any of the satellite repos around IOTstack. We're focused on the problem at hand. No apologies for that.

By the way, among the various reasons why I created IOTstackBackup as a separate repo was because, if I fouled-up, I wanted people complaining to me, instead of complaining about IOTstack not working. I hope that makes sense. Also, that explanation isn't intended as a hint that you should do the same.

If e.g. you run another service that has a mariadb container as backend it will potentially get corrupted.

I assume you mean the situation like NextCloud where there's a dedicated instance of MariaDB, rather than the situation where container X just happens to use the MariaDB instance you get if you select MariaDB in the menu.

If I was a user of IOTstackBackup, I'd either fork the IOTstackBackup repo and add my own custom script to deal with that NextCloud-like situation, or open an issue or propose a PR for IOTstackBackup to deal with the situation.

As the maintainer, if I see another service definition get added to IOTstack that includes a dedicated MariaDB instance, I'll react to that.

On the other hand, if you mean arbitrary container X using the MariaDB instance you get if you select MariaDB in the menu, that doesn't pose a problem. X is just a user of the service. More on this below.

I am also not sure if running a backup of, say, nodered while changes are made would result in a backup the user would expect.

Flows are just JSON files so edits are either in memory (and won't be seen by a concurrent backup) or are flushed to disk on a "Deploy" (and will be seen).

If a flow writes to an SQLite database stored inside the Node-RED persistent store, that doesn't matter because SQLite databases are already copy-safe.

If a flow writes to a database in another container, that's either an instance of something where there's an existing solution (InfluxDB or MariaDB) or a case on the to-do list (like PostgreSQL where the only safe approach at the moment is to down the stack).

Can someone point me to the significant difference between stopping a database, tarballing the storage, and starting and using the internal backup function?

In practice, no difference. As far as I'm aware, if you stop any database engine then it should be safe to copy its persistent store. The engine just has to be stopped for the duration of the copy.

When you invoke an engine's internal backup function, you're asking the engine to provide you with a snapshot where it (the engine) takes responsibility for assuring that what is backed-up is both coherent and restorable. My understanding is that the basic mechanism is to wrap the backup request into a transaction via which mutual exclusion is assured. It's a read-only operation so other read-only operations can proceed in parallel while writes will be queued. But I don't believe it's as simplistic as "all writes queued for the duration of the backup". It's writes that might affect what the backup request has already read. Something like that. Bottom line: I've never seen a write rejected because of contention with a backup so, in practice, backups cause zero interference.

Consider: if you down a database engine, that does break any containers that depend on it being up, so then the "solution" becomes to down everything - the very thing we want to avoid.

Most RDBMS of my experience write backups as the SQL commands needed to recreate the schema and import the data. The files can be massive but they compress well. Influx is different in that it exports its shards (or, at least, that's what it looks like to me).

My own Influx databases are the better part of 5 years old. My busiest database is ingesting a new row every 10 seconds and is coming up to 15 million rows. Others are far less busy, typically acquiring a row every 5 minutes. The "raw" size of the persistent store is a bit under 1GB. The time for iotstack_backup_influxdb to run is about 90 seconds and the final .tar is just shy of 400MB.

So, yes, I "get" that this is large but not huge. And I also "get" that, at some point, it will make sense to investigate the Influx internal backup mechanism's ability to produce incremental backups. I just haven't had the need to do that. Yet.

I'm speaking generally. Truth to tell, I have yet to implement any of Influx's retention policy stuff and that would probably be my first port of call if I decided that my databases were getting to the point where I needed to do something about their size. For example, I really don't need 5-year-old voltage data at 10-second resolution – the max voltage over 5 minute intervals would be plenty.

Would the database still be responsive?

A simple, uneqivocal "Yes!"

Whereas a downed database can't make any such claim.

Also what is the difference between stopping nextcloud and putting it into maintenance mode (as in @Paraphraser s Repo)? The service is certainly not usable for the backup duration.

I'm not sure I can give an answer that will actually address what is probably your underlying question.

The best I can come up with is the "just following orders" excuse of what you see in iotstack_backup_nextcloud being how NextCloud recommends backups should be handled.

In principle, I agree that there is little difference between terminating the container, grabbing the persistent store, and starting the container again - vs - putting the container into maintenance mode, telling the database engine to take a self-backup. then grabbing that plus the rest of the persistent store, and taking the container out of maintenance mode.

Given the overall size of the NextCloud persistent store, any actual timing difference between down/up and maintenance mode on/off likely just disappears into the woodwork.

Probably the most that can be said is that at least NextCloud will tell the user that it is in maintenance mode - if it's down, there's nothing to provide that response.

I think having the choice of incremental backups is very interesting when running nextcloud as it can quickly become very time and disk space consuming. It took 15min/45min to create an uncompressed/compressed ~24GB backup on a RPi4. When thinking about 1 TB and above doing regular backups can become disruptive.

Yes. Agreed.

On the topic of NextCloud, a lot of things about it bug me - to the point where, having considered it as something I might use, I decided it was a waste of space. I mean that literally. A clean install with default apps consumes 1.3GB - before I add a single byte of my own data. I look at things like that and it fair screams "bloatware". I could not see why so much of what had come down from the web on first install could not come down again, automatically, on a "bare metal" restore of a backup and actually had to be included in the backup. Ideally, it should only be my data that gets backed up, not a significant chunk of stuff that comes down from the web. That made me question whether its internal self-repair was up to snuff. The whole thing gave me the heebies so I decided it was a headache I really didn't need.

I realise that isn't really much of an answer either.

But - and this is just thinking out loud - it may well be the case that you could get the best of both worlds. Putting NextCloud in maintenance mode stops it from changing its persistent store. I can imagine doing that, making an incremental backup, then out of maintenance mode. Although, Googling the topic of whether MariaDB can or can't take its own incremental backups gets conflicting info (unlike InfluxDB where it's clear that it can) so that would have to be sorted out. There's also no reason why tar and zip steps have to occur while in maintenance mode - that's what multiple cores are for! But, as I said, just thinking out loud.

To change the focus, slightly, here are some odds and ends of things that have occurred to me:

You will probably want to add ~/IOTstack/.env to your list of inclusions because that's likely to get a bit more use as we evolve IOTstack. That's the file that docker-compose uses as a source of environment variables for substitution in compose files.
You might want to read IOTstack and "override" files so you get both kinds of override file, and perhaps consider wild-card matches so you get things like docker-compose.yml.bak etc.
If I'm reading it right, the differential backups are stored in ~/IOTstack/backups/diffincbackup. In the case of IOTstackBackup, I'll react to that by excluding diffincbackup from the scope of RSYNC and RCLONE. The flip side, though, is that I haven't spotted anything (other than, perhaps, ./post_backup.sh) which handles the problem of getting differential backups off the local system. Speaking personally, I'm far more worried about the prospect of a Pi's storage becoming unreadable than I am about having backups readily available on the Pi. That's why the end-game for IOTstackBackup is getting the backup files off the local machine via SCP, RSYNC or RCLONE. Do you intend to address that or is it the user's responsibility?
My own view is that scripts without documentation in the Wiki are not playing fair with the large crop of IOTstack users who aren't IT gurus and who are using IOTstack as just a means to an end without wanting to understand the details. That one is intended as a hint. 😎

Hope this helps.

mdop commented 1 year ago

@Paraphraser Perhaps we were talking past each other, no antagonism intended. At first I thought I write a fast script for myself which then ballooned which made me think it could be an addition to the project. Your scripts didn't quite fit my needs as I already modified the menu generated docker-compose.yml quite a lot. I think a (somewhat) generalized approach to backup production not only helps advanced users, but also eases implementation of new services and removes error sources. To summarize what I took away about containers and backups from the post:

Containers like nodered can be backuped live without issue because changes in the persistent data are written in one operation preventing inconsistencies
Databases are not safe to backup live, but can be kept operational by invoking their internal backup scripts. "Normal" use will not lead to large compressed backups negating the need for incremental backups.
Containers that depend on separate database containers (e.g. NextCloud): Need to be taken down (or at least in a pause state).
Container that have a built-in database (I think e.g. Mealie does that): Need to be taken down or use their own backup script.

Perhaps the synthesis for an optimal generalized script looks something like this:

Go through the containers one by one
If "unknown" take the container (with dependencies) down and backup volumes (safe version)
If "known" either do a live backup or do something container specific.
Give the user some options to modify behavior, also which backups should be done differential/incremental

That would be quite a different beast than the current script and I am not sure I can deal with that in a reasonable time (sleep(?); familyMembers++; ). I'll leave the PR up and the admins should decide if it is still a useful addition.

I intend to address the trailing points in the post in upcoming commits and, if merged, in wiki edits.

mdop commented 1 year ago

RSYNC and SCP functionality for exporting and importing backups have been added which should allow 3-2-1 backups out of the box. I don't have dropbox, so testing rclone would be a bit more work. If there is interest I can do the work. The backup list has been updated to include .env and wildcarded docker-compose and compose-override file extensions. It should be noted that I took this list from the old backup script which I haven't updated (yet). I guess the best way to go about maintaining the script would be to only include the additional files in the backup script but not in the rm section of the restore script as this would maintain these files when restoring from old backups. On the other hand it would not be a rollback which users may expect. Anyone an opinion on that?

SensorsIot / IOTstack

Add differential/incremental backup script with service splitting #640