bit-team / backintime

Back In Time - An easy-to-use backup tool for GNU Linux using rsync in the back
https://backintime.readthedocs.io
GNU General Public License v2.0
2.01k stars 198 forks source link

Weekly snapshots not being retained #1094

Open daveTheOldCoder opened 4 years ago

daveTheOldCoder commented 4 years ago

I'm using the default values for Smart Remove. Hourly and daily snapshots are kept or deleted as expected, but no weekly snapshots are kept.

This seems to be the relevant code in /usr/share/backintime/common/snapshots.py:

        #keep one per week for the last keep_one_per_week weeks
        if keep_one_per_week > 0:
            d = now - datetime.timedelta( days = now.weekday() + 1 )
            for i in range( 0, keep_one_per_week ):
                keep_snapshots = self._smart_remove_keep_first_( snapshots, keep_snapshots, d, d + datetime.timedelta(days=8) )
                d = d - datetime.timedelta(days=7)

I'm trying to figure it out, but I thought I'd ask first if there's a known issue.

daveTheOldCoder commented 4 years ago

I found a potential fix for this problem.

I added some "logger.debug" calls, and changed the cron table entry to create log files so that I could see what's happening.

The function "_smart_remove_keepfirst" keeps the newest snapshot in the specified date/time range, since the snapshots are sorted in descending order of date/time. I added a function "_smart_remove_keeplast" that reverses the order of the snapshots and thus keeps the oldest snapshot.

In the code in the previous post, I replaced the call to "_smart_remove_keepfirst" with a call to "_smart_remove_keeplast". So far it seems to be working. But it needs further testing and analysis.

If this is actually a bug, I don't understand why it hasn't been reported before. Can someone else review the code/algorithm and provide an opinion?

buhtz commented 1 year ago

@daveTheOldCoder There is a new forming maintaining team and we do review all issues. Is this problem still relevant for you or did you find a solution?

Tag: Feedback, Bug

daveTheOldCoder commented 1 year ago

I'm still using the fix I described above, and it works properly. Please review my posts above. Do you agree with my analysis of the problem?

I can provide the exact changes I made, if that's helpful.

And thanks for taking on the job of maintaining this product. :)

daveTheOldCoder commented 1 year ago

Also: I'm using BackInTime version 1.2.1 (the latest version available with the package manager on Pop!_OS 20.04), with my changes added. I'll look at the latest files in the github repository and see if there are relevant changes. Since this issue had no replies prior to yours, I assumed that the issue had not been fixed.

daveTheOldCoder commented 1 year ago

I just looked at BackInTime version 1.3.2, and the relevant code in common/snapshots.py is unchanged from version 1.2.1.

daveTheOldCoder commented 1 year ago

This can be fixed with a one-line change in common/snapshots.py:

In def smartRemoveKeepFirst, change: for sid in snapshots: to: for sid in reversed(snapshots):

But it would probably be "cleaner" to rename smartRemoveKeepFirst to smartRemoveKeepLast or simply smartRemove, and then change the calls to it.

buhtz commented 5 months ago

Dear Dave, sorry for let you wait on this and thanks for your effort and your suggested fix.

The problem is not only resources but the test coverage. The smart remove behavior is not covered by unit tests. Most of the tests in BIT are system or integration tests. And I assume the related code is not well isolated to create a real unit test for it. Because I am not the founder of BIT and not deep enough into all details I am worried to modify this part of the code without test coverage. Introducing bugs into this behavior can be a huge problem and effect a lot of people.

Would you like to jump in? We take you by hand and mentoring you if you need it. So the first job would be to analyze the testability of the smart-remove code. Second would be to carefully refactor and modify that code, hopefully without modifying its behavior, to make it testable. Third job would be to create real and isolated unit tests to fully cover the current behavior. Forth job then is to introduce your fix, running tests on it etc pp.

EDIT: This is how the relevant code section (in common/snapshots.py::Snapshots.smartRemoveList) looks today: https://github.com/bit-team/backintime/blob/e818aeb2b009e2be052685c70dae202c88a00aee/common/snapshots.py#L1674-L1684

I couldn't resist and took a closer look about testabillity. As I thought it is nearly untestable without heavy mocking and hacking. Beside the following points I assume there will come up even more when looking close and longer. Based on my short observation the following points need refactoring to make the behavior testable:

It is much and I just added this points for documentation. I don't think that all aspects need to be full filled to make the target behavior testable. I would say that "decouple from SID" would be a good start. Implementing this will take multiple PRs.

daveTheOldCoder commented 5 months ago

I still don't understand why no one else has reported this problem. For me, it's a critical problem. I wouldn't be able to use BackInTime without my fix.

buhtz commented 5 months ago

Good question. Can you send us a screenshot of your smart remove tab please.

daveTheOldCoder commented 5 months ago

screenshot

buhtz commented 5 months ago

Your screenshot looks OK for me.

I checked my own setup. But my machine do not run every day. So I can not be sure. grafik

buhtz commented 5 months ago

This can be fixed with a one-line change in common/snapshots.py:

In def smartRemoveKeepFirst, change: for sid in snapshots: to: for sid in reversed(snapshots):

But it would probably be "cleaner" to rename smartRemoveKeepFirst to smartRemoveKeepLast or simply smartRemove, and then change the calls to it.

I don't get how this solve your problem. Why does it matter if the first or the last snapshot in a week-range is kept? To my understanding this does not explain why you lose the weekly snapshots.

Maybe you can explain it better with same example dates?

daveTheOldCoder commented 5 months ago

It's been almost four years since I submitted the issue. I'll try to find my notes.

emtiu commented 1 month ago

A way to efficiently test this issue might be to manipulate the date/time of a VM to simulate the passing of weeks and months quickly.

emtiu commented 1 month ago

I have created a testbed with the following shell code in a VM: for N in {1..60}; do sudo date -s '+7 days'; echo date > backupfile; backintime backup; done

In extreme cases, it appears to work as expected: Keeping "one snapshot per week for the last 52 weeks" really does leave 1 years' worth of weekly snapshots. However, I'm not so sure about shorter timespans. Calculating dates is hard!

My current guess is that we may be dealing with an off-by-1 problem here, where a setting of "one snapshot per week for the last n weeks" only actually keeps one for the last n–1 weeks. This would be especially significant for the default configuration of:

buhtz commented 1 month ago

I once had a tool called faketime I used for things like this. Worked well.

emtiu commented 1 month ago

I think I'm very close to cracking this case. In this function: https://github.com/bit-team/backintime/blob/22f468c32732be6c821686e0c504b4dfb920487a/common/snapshots.py#L1686-L1696smartRemoveKeepFirst is called with a max_date of d + datetime.timedelta(days=8) — not days=7!

Therefore, in a routine that says "keep only the youngest snapshot of these 8 days to consider", it will also delete the youngest snapshot of the previous week — but only if it's caught by the last in a series of "keep one per week" calls.

In effect, of "one snapshot per week for the last n weeks", the n-th week has its youngest snapshot thrown away, even though it shouldn't.

I'll need to investigate if changing the call of smartRemoveKeepFirst to a max_date of d + datetime.timedelta(days=7) fixes the problem. There must have been some reason to set an 8 there.

P.S.: The official soundtrack for this bug is Only For The Week.

buhtz commented 1 month ago

There are some unit tests in test_snapshot.py about that smart remove thing. I touched some once long ago but then gave up. A full refactoring of that tests would be the best. But a first step would be to see if this use case is covered by that tests and add such a test if not.

daveTheOldCoder commented 1 month ago

It occurred to me that what makes the "smart remove" behavior confusing is that the snapshots are named using only the timestamp. The names don't contain "hourly", "daily", "weekly", etc.

I wonder if adding an index file that identifies daily, weekly, monthly and yearly snapshots would help.

Or maybe I'm wrong. :)

emtiu commented 1 month ago

There are some unit tests in test_snapshot.py about that smart remove thing. I touched some once long ago but then gave up. A full refactoring of that tests would be the best. But a first step would be to see if this use case is covered by that tests and add such a test if not.

I'm confident that I can fix this bug, and to make sure I don't introduce any new ones, I'm doing some manual testing in a VM.

I'm totally unfamiliar with automatic testing, though. I'll take a look, but I'd rather not put this bugfix on the "waiting list" until I've cracked the automatic testing of Smart Remove, because that would probably take a very long time ;)

emtiu commented 1 month ago

It occurred to me that what makes the "smart remove" behavior confusing is that the snapshots are named using only the timestamp. The names don't contain "hourly", "daily", "weekly", etc.

I wonder if adding an index file that identifies daily, weekly, monthly and yearly snapshots would help.

Or maybe I'm wrong. :)

I see what you mean, but BiT doesn't work like that. Snapshots are not created as "weekly" or "monthly" snapshots. In fact, they are only characterized by their date+time of saving (plus a random three-digit number, which is called the "tag"). Together, this makes a "Snapshot ID" (SID for short): 20241117-115358-562

The only time when a snapshot might be characterized as "weekly" or "monthly" is when the Smart Remove routine determines: "this snapshot should be kept, because it's the only one from that week/month, and it should be kept according to the removal rules".

Therefore, any single snapshot can start out as an "hourly" snapshot, then become a "weekly" one, and maybe even a "monthly" one later.

daveTheOldCoder commented 1 month ago

I see what you mean, but BiT doesn't work like that.

I know that. To figure out the workaround I posted above, I added logging code in the smart-remove processing to log the "kept" and "removed" snapshots, so I could see what was happening.

emtiu commented 1 month ago

I know that. To figure out the workaround I posted above, I added logging code in the smart-remove processing to log the "kept" and "removed" snapshots, so I could see what was happening.

I see, and I've also understood why your workaround is successful. It's hard to put into words ;)

What happens is that the code considers snapshots from 8 days (instead of 7) when determining which to keep for a specific week. But in a one-snapshot-per-week environment, 8 days will likely contain two snapshots. Since smartRemoveKeepFirst keeps only the newer one, the older one is lost.

This has no consequences as long as the previous week is also considered before deletion. But it fails for the last of the weeks to be considered (resulting in the n–1 situation I described above).

Your workaround reverses the list of snapshots, so in the "8-day-week" that is the last to be considered, the older of the two snapshots is kept, and your problem disappears.

However, it is my understanding that the problem disappears completely when days=8 is changed into days=7 :)