Venom1991 / refind-btrfs

Generate rEFInd manual boot stanzas from Btrfs snapshots
GNU General Public License v3.0
144 stars 8 forks source link

Systemd service doesnt trigger on snapshot creation, errors on install #14

Open Meister1593 opened 2 years ago

Meister1593 commented 2 years ago

Because of this, i need to manually run refind-btrfs each time i want to populate refind with snapshots. All journalctl -u refind-btrs -b says is this:

-- Journal begins at Tue 2021-08-10 23:52:30 +04, ends at Thu 2021-08-12 00:35:19 +04. --
Aug 12 00:20:29 plyshkahost systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 00:20:29 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[635]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/2, /.snapshots/3, /.snapshots/4, /.snapshots/5, /.snapshots/6.
Aug 12 00:20:29 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[635]: Starting the observer with PID 635.
Aug 12 00:20:29 plyshkahost systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.

And also when i run this, it gives me such error.

configs: refind-btrfs.conf refind.conf

Venom1991 commented 2 years ago

This is a fresh installation?
I'd try deleting the "local_db" file in the /var/lib/refind-btrfs directory, deleting created snapshots in /root/.refind-btrfs (btrfs subvolume delete ) and running the tool manually again.

Meister1593 commented 2 years ago

This is a fresh installation? I'd try deleting the "local_db" file in the /var/lib/refind-btrfs directory and running the tool manually again.

Yes, it's fresh installation Removing local db and files in /root/.refind-btrfs didnt help, same error /root/.refind-btrfs isnt a subvolume, its just a folder My real snapshots are stored in /.snapshots subvol

Venom1991 commented 2 years ago

Generated writable snapshots are stored in the /root/.refind-btrfs directory because you've set the "modify_read_only_flag" option to false (default value).
Can you print the contents of the btrfs-snapshot-stanzas/root_vmlinuz-linux.conf file? It's a directory relative to rEFInd's root directory.

I think this has something to do with having a separate /boot partition - the logic of matching boot stanzas with snapshots and the root device is different in this case. Recent changes might have broken it, somehow.

I'll have to try and reproduce it locally, later today. The absence of unit tests is starting to become a real nuisance...

Meister1593 commented 2 years ago

Generated writable snapshots are stored in the /root/.refind-btrfs directory because you've set the "modify_read_only_flag" option to false (default value). Can you print the contents of the btrfs-snapshot-stanzas/root_vmlinuz-linux.conf file? It's a directory relative to rEFInd's root directory.

I think this has something to do with having a separate /boot partition - the logic of matching boot stanzas with snapshots and the root device is different in this case. Recent changes might have broken it, somehow.

I'll have to try and reproduce it locally, later today. The absence of unit tests is starting to become a real nuisance...

Here: root_vmlinuz-linux.conf

Yes, i do have separate /boot partition (for luks2 to work properly) this is my layout:

NAME                 MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                    8:0    0 931.5G  0 disk  
├─sda1                 8:1    0   512M  0 part  /boot
└─sda2                 8:2    0   931G  0 part  
  └─crypt            254:0    0   931G  0 crypt /var/tmp
                                                /var/cache/pacman/pkg
                                                /srv
                                                /home
                                                /btrfs
                                                /.snapshots
                                                /.swapvol
                                                /
Venom1991 commented 2 years ago

I think that this line might have something to do with the issue you're experiencing.

Would it be possible for you to comment it out (or delete it altogether) and try running the refind-btrfs script? It's just so I can narrow down the cause of the problem. I'm not entirely sure if that's going to work on its own though, you should probably also delete these two files:

I'd also stop the systemd service although it probably just exits on its own (unexpected exception).

Meister1593 commented 2 years ago

I deleted these two python files, stopped service and commented these lines. After running refind-btrfs, it didnt fail and seems to work fine

Venom1991 commented 2 years ago

Thank you, that's really helpful information. Fixing it should be easier now.

You could try running the service now and seeing whether or not it works after a snapshot is automatically taken by Snapper.

Meister1593 commented 2 years ago

Doesn't seem like its generating new stanza after creating snapshot... (the service)

Venom1991 commented 2 years ago

Hm, an error should be visible in the journal via journalctl -u refind-btrs -b.
Did you create a snapshot manually by running snapper -c root create -c number?

Meister1593 commented 2 years ago

Hm, an error should be visible in the journal via journalctl -u refind-btrs -b. Did you create a snapshot manually by running snapper -c root create -c number?

No errors I did, yes snapper create -t single

Venom1991 commented 2 years ago

No errors I did, yes snapper create -t single

And a new snapshot was created under /.snapshots? I don't understand why there's no output in the journal at all.

Meister1593 commented 2 years ago

No errors I did, yes snapper create -t single

And a new snapshot was created under /.snapshots? I don't understand why there's no output in the journal at all.

Yes, they are created in that folder and visible by snapper list Pre-post snapshots doesnt seem to be triggering service too

Venom1991 commented 2 years ago

What exactly does journalctl -u refind-btrs -b print now? Same as in your original post?

Meister1593 commented 2 years ago

What exactly does journalctl -u refind-btrs -b print now? Same as in your original post?

-- Journal begins at Tue 2021-08-10 23:52:30 +04, ends at Thu 2021-08-12 15:17:51 +04. --
Aug 12 10:40:55 plyshkahost systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 10:40:56 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[1068]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/10, /.snapshots/11, /.snapshots/12, /.snapshots/13, /.snapshots/14, /.snapshots/15, /.snapshots/16, /.snapshots/17, /.snapshots/18, /.snapshots/19, /.snapshots/2, /.snapshots/20, /.snapshots/21, /.snapshots/22, /.snapshots/23, /.snapshots/24, /.snapshots/25, /.snapshots/26, /.snapshots/27, /.snapshots/28, /.snapshots/29, /.snapshots/3, /.snapshots/30, /.snapshots/31, /.snapshots/32, /.snapshots/33, /.snapshots/34, /.snapshots/35, /.snapshots/36, /.snapshots/37, /.snapshots/38, /.snapshots/39, /.snapshots/4, /.snapshots/40, /.snapshots/41, /.snapshots/5, /.snapshots/6, /.snapshots/7, /.snapshots/8, /.snapshots/9.
Aug 12 10:40:56 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[1068]: Starting the observer with PID 1068.
Aug 12 10:40:56 plyshkahost systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.
Aug 12 14:45:58 plyshkahost systemd[1]: Stopping Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 14:45:58 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[1068]: Received terminating signal 15, stopping the observer...
Aug 12 14:45:59 plyshkahost systemd[1]: refind-btrfs.service: Deactivated successfully.
Aug 12 14:45:59 plyshkahost systemd[1]: Stopped Generate rEFInd manual boot stanzas from Btrfs snapshots.
Aug 12 14:45:59 plyshkahost systemd[1]: refind-btrfs.service: Consumed 1.861s CPU time.
Aug 12 14:54:40 plyshkahost systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 14:54:40 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[19295]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/10, /.snapshots/11, /.snapshots/12, /.snapshots/13, /.snapshots/14, /.snapshots/15, /.snapshots/16, /.snapshots/17, /.snapshots/18, /.snapshots/19, /.snapshots/2, /.snapshots/20, /.snapshots/21, /.snapshots/22, /.snapshots/23, /.snapshots/24, /.snapshots/25, /.snapshots/26, /.snapshots/27, /.snapshots/28, /.snapshots/29, /.snapshots/3, /.snapshots/30, /.snapshots/31, /.snapshots/32, /.snapshots/33, /.snapshots/34, /.snapshots/35, /.snapshots/36, /.snapshots/37, /.snapshots/38, /.snapshots/39, /.snapshots/4, /.snapshots/40, /.snapshots/41, /.snapshots/42, /.snapshots/43, /.snapshots/44, /.snapshots/45, /.snapshots/46, /.snapshots/47, /.snapshots/48, /.snapshots/5, /.snapshots/6, /.snapshots/7, /.snapshots/8, /.snapshots/9.
Aug 12 14:54:40 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[19295]: Starting the observer with PID 19295.
Aug 12 14:54:40 plyshkahost systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.
Aug 12 15:00:30 plyshkahost systemd[1]: Stopping Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 15:00:30 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[19295]: Received terminating signal 15, stopping the observer...
Aug 12 15:00:31 plyshkahost systemd[1]: refind-btrfs.service: Deactivated successfully.
Aug 12 15:00:31 plyshkahost systemd[1]: Stopped Generate rEFInd manual boot stanzas from Btrfs snapshots.
Aug 12 15:04:07 plyshkahost systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 15:04:07 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[20301]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/10, /.snapshots/11, /.snapshots/12, /.snapshots/13, /.snapshots/14, /.snapshots/15, /.snapshots/16, /.snapshots/17, /.snapshots/18, /.snapshots/19, /.snapshots/2, /.snapshots/20, /.snapshots/21, /.snapshots/22, /.snapshots/23, /.snapshots/24, /.snapshots/25, /.snapshots/26, /.snapshots/27, /.snapshots/28, /.snapshots/29, /.snapshots/3, /.snapshots/30, /.snapshots/31, /.snapshots/32, /.snapshots/33, /.snapshots/34, /.snapshots/35, /.snapshots/36, /.snapshots/37, /.snapshots/38, /.snapshots/39, /.snapshots/4, /.snapshots/40, /.snapshots/41, /.snapshots/42, /.snapshots/43, /.snapshots/44, /.snapshots/45, /.snapshots/46, /.snapshots/47, /.snapshots/48, /.snapshots/49, /.snapshots/5, /.snapshots/6, /.snapshots/7, /.snapshots/8, /.snapshots/9.
Aug 12 15:04:07 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[20301]: Starting the observer with PID 20301.
Aug 12 15:04:07 plyshkahost systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.

After last entry, I did create snapshots 2-3 times

Venom1991 commented 2 years ago

If it does work it should output various information to the journal, not just that it started.

Btw, I also noticed on my own machine that the filesystem events are not raised and/or caught by (not sure) the Watchdog library which I'm using for this feature. It happens exactly at midnight but not every day, pretty weird.

The official package repo now has a pretty outdated version of this library. I'm the one who flagged it, if I remember correctly.

Meister1593 commented 2 years ago

If it does work it should output various information to the journal, not just that it started.

Btw, I also noticed on my own machine that the filesystem events are not raised and/or caught by (not sure) the Watchdog library which I'm using for this feature. It happens exactly at midnight but not every day, pretty weird.

The official package repo now has a pretty outdated version of this library. I'm the one who flagged it, if I remember correctly.

Gonna install git version ASAP)

Venom1991 commented 2 years ago

You mean this one? Hm, maybe that does the trick but I'm not too sure...

Meister1593 commented 2 years ago

Got this error after making snapshot

Aug 12 15:49:54 plyshkahost systemd[1]: Starting Generate rEFInd manual boot stanzas from Btrfs snapshots...
Aug 12 15:49:54 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[27571]: Scheduling watch for directories: /.snapshots, /.snapshots/1, /.snapshots/10, /.snapshots/11, /.snapshots/12, /.snapshots/13, /.snapshots/14, /.snapshots/15, /.snapshots/16, /.snapshots/17, /.snapshots/18, /.snapshots/19, /.snapshots/2, /.snapshots/20, /.snapshots/21, /.snapshots/22, /.snapshots/23, /.snapshots/24, /.snapshots/25, /.snapshots/26, /.snapshots/27, /.snapshots/28, /.snapshots/29, /.snapshots/3, /.snapshots/30, /.snapshots/31, /.snapshots/32, /.snapshots/33, /.snapshots/34, /.snapshots/35, /.snapshots/36, /.snapshots/37, /.snapshots/38, /.snapshots/39, /.snapshots/4, /.snapshots/40, /.snapshots/41, /.snapshots/42, /.snapshots/43, /.snapshots/44, /.snapshots/45, /.snapshots/46, /.snapshots/47, /.snapshots/48, /.snapshots/49, /.snapshots/5, /.snapshots/50, /.snapshots/51, /.snapshots/52, /.snapshots/53, /.snapshots/54, /.snapshots/6, /.snapshots/7, /.snapshots/8, /.snapshots/9.
Aug 12 15:49:54 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[27571]: Starting the observer with PID 27571.
Aug 12 15:49:54 plyshkahost systemd[1]: Started Generate rEFInd manual boot stanzas from Btrfs snapshots.
Aug 12 15:50:50 plyshkahost /usr/lib/python3.9/site-packages/refind_btrfs/__main__.py[27571]: ERROR (refind_btrfs.service.snapshot_observer/snapshot_observer.py/run): An unexpected error happened, exiting...
                                                                                              Traceback (most recent call last):
                                                                                                File "/usr/lib/python3.9/site-packages/refind_btrfs/service/snapshot_observer.py", line 45, in run
                                                                                                  self.dispatch_events(self.event_queue, self.timeout)
                                                                                                File "/usr/lib/python3.9/site-packages/watchdog/observers/api.py", line 372, in dispatch_events
                                                                                                  handler.dispatch(event)
                                                                                                File "/usr/lib/python3.9/site-packages/watchdog/events.py", line 282, in dispatch
                                                                                                  self.event_dispatch_map[event.event_type](event)
                                                                                              AttributeError: 'SnapshotEventHandler' object has no attribute 'event_dispatch_map'
Aug 12 15:50:50 plyshkahost systemd[1]: refind-btrfs.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 15:50:50 plyshkahost systemd[1]: refind-btrfs.service: Failed with result 'exit-code'.
Venom1991 commented 2 years ago

That version is even more outdated, it seems (breaking API changes) but at least something happened, I guess.

Meister1593 commented 2 years ago

So, whats next thing to do, how can I help?

Venom1991 commented 2 years ago

Well, currently nothing. I'll try to fix the issue regarding the error you've seen (script's output) and then we'll just have to see when I make a new release.
I've never asked anyone to delete CPython's cache before (nor did I try that myself), there could be some residue left of which I'm not aware. It's best to overwrite everything with a new version (both source and bytecode files).

I'll also check the service's output on my own machine today even though it worked just fine yesterday.

Venom1991 commented 2 years ago

I've found the culprit (I hope so), pushed the fix and created a new release. Please check it out when you can and report back here.

Meister1593 commented 2 years ago

I've found the culprit (I hope so), pushed the fix and created a new release. Please check it out when you can and report back here.

I just installed new version, got warning about not found cpython file but after installation it got installed just fine checked that file where i commented out code, it got uncommented (which indicates it was replaced by new version) checked refind-btrfs - no errors and it worked just fine service, of course, still doesnt work

Venom1991 commented 2 years ago

This is the output I get with a freshly installed version and a snapshot created at 20:00 PM (18:00 UTC):

refind-btrfs-log.txt

I really have no idea why it doesn't work on your machine. What is so special about your setup, I wonder? You're using Snapper as am I but that shouldn't really matter.

Damn, I've just noticed that Watchdog currently has 142 open issues... I wish there was an alternative library, at this point.

Meister1593 commented 2 years ago

I use luks2, i have two additional read-only disks (old disks that i currently use as migration storage to new), they are luks2 too both and one of them have 2 efistubs on fat32 boot partition, which i can boot into. I mostly followed this guide for setting up btrfs on arch, since i barely knew about btrfs and wanted some basic guide to make system with btrfs (i did make other luks setup mostly by myself)

Venom1991 commented 2 years ago

The same author has another guide, as well. I don't know why he felt the need to have more than one.
He also had a couple of issues with refind-btrfs which were fixed a few months ago.

I mean it's entirely possible to use inotify directly and run the script when needed but I didn't want to go with that option because Watchdog already utilizes it on Linux and it's been working more or less fine for me and other users, it seems.
You are the first person who's experiencing these issues, unfortunately, so I'm somewhat in the dark here.

Venom1991 commented 2 years ago

Hm, I've found this interesting piece of information. There is apparently a maximum amount of watchers allowed by inotify. Perhaps you could try increasing it and see what happens then? Mine is set at 524288, try setting it to 1048576 and rebooting the system. You can check the current value with cat /proc/sys/fs/inotify/max_user_watches.

EDIT: This setting can be changed temporarily, as well (the first link).

Meister1593 commented 2 years ago

Hm, I've found this interesting piece of information. There is apparently a maximum amount of watchers allowed by inotify. Perhaps you could try increasing it and see what happens then? Mine is set at 524288, try setting it to 1048576 and rebooting the system. You can check the current value with cat /proc/sys/fs/inotify/max_user_watches.

EDIT: This setting can be changed temporarily, as well (the first link).

I increased that watchdog limit via sysctl and it didn't work either, same 4 lines in status...

As of now i decided to make at least temporary hack to run refind-btrfs on startup and shutdown:

refind-btrfs-create.service

[Unit]
Description=Populate snapshots on startup and shutdown

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/refind-btrfs
ExecStop=/usr/bin/refind-btrfs

[Install]
WantedBy=multi-user.target
Venom1991 commented 2 years ago

Pretty weird, this issue definitely requires further investigation.

wcasanova commented 2 years ago

I have the same problem, increase the inotifywait, and check the ones in use. 1701 WATCHES TOTAL COUNT https://github.com/fatso83/dotfiles/blob/master/utils/scripts/inotify-consumers installation similar to Th3Whit3Wolf

ERROR (refind_btrfs/__init__.py/main): An unexpected error happened, exiting...
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/refind_btrfs/__init__.py", line 80, in main
    exit_code = runner.run()
File "/usr/lib/python3.10/site-packages/refind_btrfs/service/watchdog_runner.py", line 93, in run
    observer.start()
File "/usr/lib/python3.10/site-packages/watchdog/observers/api.py", line 259, in start
    emitter.start()
File "/usr/lib/python3.10/site-packages/watchdog/utils/__init__.py", line 116, in start
    self.on_thread_start()
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify.py", line 120, in on_thread_start
    self._inotify = InotifyBuffer(path, self.watch.is_recursive)
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify_buffer.py", line 35, in __init__
    self._inotify = Inotify(path, recursive)
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify_c.py", line 189, in __init__
    Inotify._raise_error()
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify_c.py", line 426, in _raise_error
    raise OSError(errno.EMFILE, "inotify instance limit reached")
OSError: [Errno 24] inotify instance limit reached
Venom1991 commented 2 years ago

I have the same problem, increase the inotifywait, and check the ones in use. 1701 WATCHES TOTAL COUNT https://github.com/fatso83/dotfiles/blob/master/utils/scripts/inotify-consumers installation similar to Th3Whit3Wolf

ERROR (refind_btrfs/__init__.py/main): An unexpected error happened, exiting...
Traceback (most recent call last):
File "/usr/lib/python3.10/site-packages/refind_btrfs/__init__.py", line 80, in main
  exit_code = runner.run()
File "/usr/lib/python3.10/site-packages/refind_btrfs/service/watchdog_runner.py", line 93, in run
  observer.start()
File "/usr/lib/python3.10/site-packages/watchdog/observers/api.py", line 259, in start
  emitter.start()
File "/usr/lib/python3.10/site-packages/watchdog/utils/__init__.py", line 116, in start
  self.on_thread_start()
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify.py", line 120, in on_thread_start
  self._inotify = InotifyBuffer(path, self.watch.is_recursive)
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify_buffer.py", line 35, in __init__
  self._inotify = Inotify(path, recursive)
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify_c.py", line 189, in __init__
  Inotify._raise_error()
File "/usr/lib/python3.10/site-packages/watchdog/observers/inotify_c.py", line 426, in _raise_error
  raise OSError(errno.EMFILE, "inotify instance limit reached")
OSError: [Errno 24] inotify instance limit reached

Interesting, it could be caused by either one of these:

Arch is still using the old 0.10.x watchdog's branch intended for what is now a prehistoric Python version instead of the newer 1.0.x branch intended for Python 3.6 and newer.
I've personally flagged the package as outdated, months ago.

wcasanova commented 2 years ago

Interesting, it could be caused by either one of these:

* a pretty huge number of snapshots

if I had too many snapshots, the cleanup service had stopped. delete the local_db, and the confs in refind now works