framps / raspiBackup

Create and keep multiple backup versions of your running Raspberries
https://raspibackup.linux-tips-and-tricks.de
GNU General Public License v3.0
809 stars 74 forks source link

raspiBackup tar backup operation hangs in ls on Bookworm using cifs #760

Closed madbrain76 closed 3 months ago

madbrain76 commented 5 months ago

I did a new install of Bookworm on my Raspberry Pi 4B. I am backing up onto a ZFS file system on my Ubuntu 22.04 NAS, shared via Samba, and connected via cifs on the Pi side. The file system is mounted in fstab like this šŸ‘

//higgs.local/Network\040Backups /higgs_backup cifs _netdev,credentials=/etc/creds,uid=madbrain,gid=madbrain,x-systemd.automount 0 0

The problem I see is that the backup operation hangs at the end with this message : --- RBK0033I: Please wait until cleanup has finished.

I have waited over 10 hours, and it didn't change.

I reviewed the process list with pstree and found that raspiBackup had an ls child process. ps -ef showed the command was "ls -la /backup/pi64/pi64-tar-backup-20240525-173558". I tried to run that command in a terminal, and it hung as well. However, I could list older backups just fine in /backup/pi64 . Something is happening with the specific directory. I tried to use strace -p to see where ls was stuck, but it hung also. The ls process cannot be killed even with kill -9. I logged on to the NAS and could list the backup directory just fine. I am unable to unmount /backup even with umount -f or umount -l . It just says the source is busy. If I try to reboot Phe pi after the backup hangs, it just hangs and doesn't actually reboot. I have to unplug the power supply phyiscally for it to come back up.

This is most likely some kind of cifs model kernel bug, but I have so far only seen it with raspiBackup, so I'm filing the issue here to start.

I will be attaching log files shortly - just need to run one more backup.

madbrain76 commented 5 months ago

Here is the log. I'm really not too sure what to make of it, but one of the operations is triggering the bug. My OS is entirely up to date at this moment.

madbrain@pi64:/tmp $ uname -a Linux pi64 6.6.28+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.28-1+rpt1 (2024-04-22) aarch64 GNU/Linux

raspiBackup.log

madbrain76 commented 5 months ago

More diagnostic - looks like this is related to the img file.

madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707
^C
madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707
^C
madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707/*
^C
madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707/*.tar
-rwxr-xr-x 1 madbrain madbrain 22035210240 May 25 18:07 /backup/pi64/pi64-tar-backup-20240525-180707/pi64-tar-backup-20240525-180707.tar
madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707/*.mbr
-rwxr-xr-x 1 madbrain madbrain 512 May 25 18:07 /backup/pi64/pi64-tar-backup-20240525-180707/pi64-backup.mbr
madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707/*.sfdisk
-rwxr-xr-x 1 madbrain madbrain 198 May 25 18:07 /backup/pi64/pi64-tar-backup-20240525-180707/pi64-backup.sfdisk
madbrain@pi64:~ $ ls -al /backup/pi64/pi64-tar-backup-20240525-180707/*.img
madbrain76 commented 5 months ago

trailing output of ls with that file .

close(3)                                = 0
futex(0x7fb41918cc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
statx(AT_FDCWD, "/backup/pi64/pi64-tar-backup-20240525-180707/pi64-backup.img", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW|AT_NO_AUTOMOUNT, STATX_MODE|STATX_NLINK|STATX_UID|STATX_GID|STATX_MTIME|STATX_SIZE, 

Looks like it's hanging in statx. Probably more helpful for the CIFS kernel people. But I still wonder what's different about that img file.

framps commented 5 months ago

Starting with Bookworm I get more and more reports about raspiBackup hanging in the ls command when the tar backup type is used and the backup partition is a mounted cifs partition.

I had one guy who used raspiBackup successfully for a long time on Bullseye. When he then setup a new system with Bookworm and configured everything identical to the Bullseye systems he got the strange ls hang. He finally decided to go back to Bullseye and raspiBackup didn't hang any more.

Looks like it's hanging in statx. Probably more helpful for the CIFS kernel people. But I still wonder what's different about that img file.

I agree with you and it's a CIFS and/or kernel issue in Bookworm.

I frankly don't know how to debug this issue. You're much more experienced than I because you nailed down the issue to the img file :wink:

I tried to use strace -p to see where ls was stuck, but it hung also. The ls process cannot be killed even with kill -9. I logged on to the NAS and could list the backup directory just fine. I am unable to unmount /backup even with umount -f or umount -l . It just says the source is busy. If I try to reboot Phe pi after the backup hangs, it just hangs and doesn't actually reboot. I have to unplug the power supply phyiscally for it to come back up.

That's another indication for me there is some issue with CIFS and/or the kernel.

If you have and idea how to debug this issue and further more to report the issue to the CIFS and/ore kernel folks please let me know.

madbrain76 commented 5 months ago

Ha ! Yeah, I have been coding since 1989. That was some time before Linux. My DOS & OS/2 days. I'm still mainly a C/C++ programmer nowadays.

I actually found a workaround, mounting the share differently :

//server10g.local/zfs/Backups /backup cifs _netdev,credentials=/etc/creds,uid=madbrain,gid=madbrain,x-systemd.automount 0 0

//server10g.localdomain/zfs/Backups /backup cifs credentials=/etc/creds,vers=3.0,sec=ntlmssp,uid=madbrain,gid=madbrain,iocharset=utf8,soft,noserverino,cache=none,actimeo=30 0 0

The first way causes the hang. This is the syntax I was using with Bullseye.

The second way doesn't hang. However, the backup takes 7 times as long ! 20 minutes vs 3, for about 9GB. If the Pi4 was purely IO-bound, it should take only about 1m16s on Gigabit Ethernet. I also have a 2.5 Gbps USB NIC I'm not using because it's not stable enough. NAS is 10 Gbps with over 100TB.

I'm going to try to reduce the test case to just 3 commands - mount, dd and ls . Hopefully I can succeed and hand that off to cifs kernel guys.

framps commented 5 months ago

I'm going to try to reduce the test case to just 3 commands - mount, dd and ls . Hopefully I can succeed and hand that off to cifs kernel guys.

That's great. Thank you very much.

I actually found a workaround, mounting the share differently :

Thank you very much for sharing. As far as I can see you turn off anything which will speed up the data transfer and thus it takes much longer. But at least there is no hang any more :+1:

Would be great if you keep me updated on the progress you make on this issue. As I already wrote: You're not the first one who reports this issue and I would be happy if I can tell future folks how to fix the issue. Not sure if everybody is willing to increase the backup time by factor 7 with your workaround :smirk:

madbrain76 commented 5 months ago

I figured I wouldn't be the first one. Unfortunately, so far,

I left the problem-causing entry in fstab for /backup, and this happened.

madbrain@pi64:~ $ sudo dd if=/dev/sda1 of="/backup/test.img" bs=1M
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 6.6404 s, 80.8 MB/s
madbrain@pi64:~ $ ls -al /backup/test.img
-rwxr-xr-x 1 madbrain madbrain 536870912 May 26 06:13 /backup/test.img
madbrain@pi64:~ $ 

ie. no issue. So, I failed to reproduce the problem outside of invoking raspiBackup. There has got to be something else the program does that triggers it. Can't imagine what it is.

framps commented 5 months ago

I failed to reproduce the problem outside of invoking raspiBackup.

:cry: There is nothing special: raspiBackup just uses the mounted partition. Maybe it's the tar command used which causes the issue? You can see the command used in the debug log:

-- RBK0085I: Backup of type tar started. Please be patient.
20240525-180720 DBG 2589:                  --> executeTar 
20240525-180720 DBG 2592:                      --- LC_ALL=C tar         -cpi --one-file-system                      -v      -f "/backup/pi64/pi64-tar-backup-20240525-180707/pi64-tar-backup-20240525-180707.tar"       --warning=no-xdev       --numeric-owner         --exclude="/backup/*"       --exclude="/tmp/raspiBackup.log"        --exclude="/tmp/raspiBackup.msg"        --exclude='.gvfs'       --exclude=/proc/*       --exclude=/lost+found/*         --exclude=/sys/*        --exclude=/dev/*        --exclude=/tmp/*        --exclude=/swapfile         --exclude=/run/*        --exclude=/media/*              /

There has got to be something else the program does that triggers it.

One of the other folks also reported ls just works fine when invoked outside of raspiBackup. But for me it seems to be an intermittent issue and not related to raspiBackup.

I created a dedicated label for this issue so I can track this issue better in the future. I unfortunately don't find the other places where this issue was reported. Unfortunately I also allow to report issues in a forum and on my website. As far as I remember the guy who went back from Bookworm even reported the issue in German.

Anyhow: I will now keep an eye on this issue. If for some reasons you find anything else which may help to locate the root cause please let me know.

Question now is: How do you handle this issue now? Use your workaround?

framps commented 5 months ago

I found two other issues here in github which also report an ls hang and added the label. One guy uses Buster and not Bookworm. I've never had any issue with nfs ...

madbrain76 commented 5 months ago

I always used CIFS/Samba for my backups. I never saw any hang on Buster, only Bookworm. Not sure if it's the tar command - that sounds unlikely, but perhaps it is the trigger.

As to what I'm going to personally do about it, I'm not sure yet. I may go back to Buster. I have two separate SSDs, one with my old Buster setup and another with the new Bookworm setup.

I also found another annoying a bug with the clipboard on wayvnc . I don't think I ran into that before Bookworm.

madbrain76 commented 5 months ago

Some progress. I wrote the following script to simulate raspiBackup .

#!/bin/bash
echo "clean up previous test"
rm /backup/test/*
echo "stopping services"
systemctl stop smokeping && systemctl stop apache2 && systemctl stop cron && systemctl stop cups-browsed && systemctl stop cups && systemctl stop nmbd && systemctl stop smbd
echo "running tar"
tar -cpi --one-file-system -f "/backup/test/pi64.tar" --warning=no-xdev --numeric-owner --exclude="/backup/*" --exclude="/tmp/raspiBackup.log" --exclude="/tmp/raspiBackup.msg" --exclude='.gvfs' --exclude=/proc/* --exclude=/lost+found/* --exclude=/sys/* --exclude=/dev/* --exclude=/tmp/* --exclude=/swapfile --exclude=/run/* --exclude=/media/* /
echo "running dd for backup.mbr"
dd if=/dev/sda of="/backup/test/backup.mbr" bs=512 count=1
echo "running dd for backup.img"
dd if=/dev/sda1 of="/backup/test/backup.img" bs=1M
echo "starting services"
systemctl start smbd && systemctl start nmbd && systemctl start cups && systemctl start cups-browsed && systemctl start cron && systemctl start apache2 && systemctl start smokeping
echo "running ls"
ls -al /backup/test
echo "success"

I ran it a few times, once under strace as follows. Here was the output :

root@pi64:~# strace --follow-forks --output-separately --output=logs/raspi ./simbackup.sh 
clean up previous test
rm: cannot remove '/backup/test/pi64.tar': No such file or directory
stopping services
running tar
tar: /backup/test/pi64.tar: Cannot open: No such file or directory
tar: Error is not recoverable: exiting now
running dd for backup.mbr
1+0 records in
1+0 records out
512 bytes copied, 0.00888068 s, 57.7 kB/s
running dd for backup.img
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 6.24561 s, 86.0 MB/s
starting services
running ls
total 5182011
drwxr-xr-x 2 madbrain madbrain           0 May 26 16:17 .
drwxr-xr-x 2 madbrain madbrain           0 May 26 15:49 ..
-rwxr-xr-x 1 madbrain madbrain   536870912 May 26 16:17 backup.img
-rwxr-xr-x 1 madbrain madbrain         512 May 26 16:17 backup.mbr
-rwxr-xr-x 0 madbrain madbrain 12348989440 May 26 16:08 pi64.tar
success
root@pi64:~# ls -al /backup/test/pi64.tar
ls: cannot access '/backup/test/pi64.tar': No such file or directory
root@pi64:~# ls -al /backup/test/*
-rwxr-xr-x 1 madbrain madbrain   536870912 May 26 16:17 /backup/test/backup.img
-rwxr-xr-x 1 madbrain madbrain         512 May 26 16:17 /backup/test/backup.mbr
-rwxr-xr-x 0 madbrain madbrain 12693237760 May 26 16:08 /backup/test/pi64.tar
root@pi64:~# rm /backup/test/pi64.tar
rm: cannot remove '/backup/test/pi64.tar': No such

pi64.tar cannot be listed, or removed. Something is FUBAR. I logged in to the NAS itself, and the file can indeed be listed and removed. I did so.

And then this happened on the pi :


oot@pi64:~# ls -al /backup/test/*
ls: cannot access '/backup/test/*': No such file or directory
root@pi64:~# tar: /var/lib/smokeping/Wifi/cca-bar.rrd: file changed as we read it
tar: /var/lib/smokeping/Wifi/ccultra.rrd: file changed as we read it

root@pi64:~# history | grep -i^C
root@pi64:~# ps -ef | grep -i tar
madbrain    1440    1020  0 06:01 ?        00:00:00 /usr/bin/gnome-keyring-daemon --start --foreground --components=secrets
root       24003       1  0 15:46 ?        00:00:00 /usr/bin/gnome-keyring-daemon --start --foreground --components=secrets
root       25227   25204 17 16:08 pts/2    00:02:02 tar -cpi --one-file-system -f /backup/test/pi64.tar --warning=no-xdev --numeric-owner --exclude=/backup/* --exclude=/tmp/raspiBackup.log --exclude=/tmp/raspiBackup.msg --exclude=.gvfs --exclude=/proc/* --exclude=/lost+found/* --exclude=/sys/* --exclude=/dev/* --exclude=/tmp/* --exclude=/swapfile --exclude=/run/* --exclude=/media/* /
root       26177       1  0 16:17 ?        00:00:00 /usr/sbin/apache2 -k start
www-data   26178   26177  0 16:17 ?        00:00:00 /usr/sbin/apache2 -k start
www-data   26179   26177  0 16:17 ?        00:00:00 /usr/sbin/apache2 -k start
www-data   26180   26177  0 16:17 ?        00:00:00 /usr/sbin/apache2 -k start
root       26351   24296  0 16:20 pts/2    00:00:00 grep -i tar

Nor really sure why tar is running in the background. Might be because I pressed CTRL-C before in one of my attempts.

In any case, this might be a different bug since it isn't a hang. I have yet to figure out how raspiBackup creates the .sfdisk file. Need to run strace on it since I couldn't make sense of it from the debug log.

madbrain76 commented 5 months ago

After a reboot, had no issue with the following script :

root@pi64:~# cat simbackup.sh 
#!/bin/bash
echo "clean up previous test"
rm /backup/test/*
echo "stopping services"
systemctl stop smokeping && systemctl stop apache2 && systemctl stop cron && systemctl stop cups-browsed && systemctl stop cups && systemctl stop nmbd && systemctl stop smbd
echo "running sfdisk"
sfdisk -d /dev/sda > /backup/test/part.sfdisk
echo "running tar"
tar -cpi --one-file-system -f "/backup/test/pi64.tar" --warning=no-xdev --numeric-owner --exclude="/backup/*" --exclude="/tmp/raspiBackup.log" --exclude="/tmp/raspiBackup.msg" --exclude='.gvfs' --exclude=/proc/* --exclude=/lost+found/* --exclude=/sys/* --exclude=/dev/* --exclude=/tmp/* --exclude=/swapfile --exclude=/run/* --exclude=/media/* --exclude=/root/logs/* /
echo "running dd for backup.mbr"
dd if=/dev/sda of="/backup/test/backup.mbr" bs=512 count=1
echo "running dd for backup.img"
dd if=/dev/sda1 of="/backup/test/backup.img" bs=1M
echo "starting services"
systemctl start smbd && systemctl start nmbd && systemctl start cups && systemctl start cups-browsed && systemctl start cron && systemctl start apache2 && systemctl start smokeping
echo "running ls"
ls -al /backup/test
echo "success"

Can you think of any operation that raspiBackup is performing that I'm missing ? I could find out with strace --follow-forks, but there are tons of log files to review, so I figured I would ask you first.

framps commented 5 months ago

Thank you very much you're still digging into the issue. I'm unfortunately unable to reproduce the issue.

I see the following list of possible tests and workarounds which may be useful:

1) I had a look into the code and found

function cleanupBackupDirectory() {

    logEntry

    logCommand "ls -la "$BACKUPTARGET_DIR""

    if (( $rc != 0 )); then

        if [[ -d "$BACKUPTARGET_DIR" ]]; then
            if [[ -z "$BACKUPPATH" || -z "$BACKUPFILE" || -z "$BACKUPTARGET_DIR" || "$BACKUPFILE" == *"*"* || "$BACKUPPATH" == *"*"* || "$BACKUPTARGET_DIR" == *"*"* ]]; then
                assertionFailed $LINENO "Invalid backup path detected. BP: $BACKUPPATH - BTD: $BACKUPTARGET_DIR - BF: $BACKUPFILE"
            fi
            writeToConsole $MSG_LEVEL_MINIMAL $MSG_REMOVING_BACKUP "$BACKUPTARGET_DIR"
            rm -rfd $BACKUPTARGET_DIR # delete incomplete backupdir
            local rmrc=$?
            if (( $rmrc != 0 )); then
                writeToConsole $MSG_LEVEL_MINIMAL $MSG_REMOVING_BACKUP_FAILED "$BACKUPTARGET_DIR" "$rmrc"
            fi
        fi
    fi

    logExit
}

As you can see the ls command is just a debugging statement to see which backups exist already and will be recycled. I have some setup to log all the stuff in /tmp/raspiBackup.log

    exec 1> >(stdbuf -i0 -o0 -e0 tee -ia "$LOG_FILE")
    exec 2> >(stdbuf -i0 -o0 -e0 tee -ia "$LOG_FILE" >&2)

Not sure - maybe the usage of logCommand together with the IO redirection is the issue? You may comment out the log statement and try to recreate the issue.

2) I also noticed you use systemd mount. Actually you don't have to do this. In a later release raspiBackup got an option dynamicMount which causes raspiBackup to mount the backup partition during start and umount at the end if the backupPartition is not already mounted. So another try would be to remove the systemd automount options, umount the backup partition and use option dynamicMount instead. There can be a memory leak or buffer overflow in the CIFS code and an unmount between the backup calls should flush all buffers and free any allocated memory. Maybe it's just the huge number of files which are xferred? My systems usually have around 10GB to backup. You have about 22GB which is not much more. But the devil is a squirrel :smile:

3) I don't think the issue is related to the sfdisk file. It's just a random file which is actually processed when the issue happened. That's why I think there is some memory leak or buffer overflow. In your workaround you used noserverino. I also discussed the issue in a forum and there one guy told me he had had already issues with CIFS and this statement solved the issue. Maybe use this option but don't use cache=none to get rid of factor 7 in your workaround.

4) raspiBackup is usually started in the background by systemd. When you execute your tests you start raspiBackup from the console. I don't know whether this makes a difference.

5) raspiBackup uses a function

function logSystemDiskState() {
    logEntry
    logCommand "blkid"
    logCommand "fdisk -l"
    logCommand "mount"
    logCommand "df -h -l"
    logExit
}

before and after the backup. I remember there were raspiBackup users who got a hang when logCommand "df -h" was used. That's why I added -llater on in a fix. But this was before this CIFS issue popped up. You may try to remove -l and check whether it now hangs there. It's one of the early calls executed but also executed at the end.

6) As another workaround I suggest to use rsync instead of tar. You mentioned you own a NAS which should be able to export a backup partition via nfs. You even will gain a performance boost because the rsync backup is an incremental backup and no full backup like tar.

madbrain76 commented 5 months ago

Yes, it's possible that redirection has something to do with it. However, I see the problem even without verbose/debug logging enabled, so I'm not sure that it does. What would be more helpful would be a list of all the commands I could grep in the log from. Say "exec: tar xxx". As it stands, there is no such common string and I have to sift through it manually. And some things like sfdisk are not logged at all.

The reason I don't use dynamic mount is that my NAS is not powered on 24/7 due to it consuming significant power and adding noise with the 8 x 14TB HDDs. I wake it up on demand with WOL. I normally invoke raspiBackup manually, not from any scheduled job. As I recall, I put in a feature request way back for a "pre-action" that I could have used to call my "wakeupnas" script. But alas, that feature request was rejected.

Would it help if I provided you with a system backup that reproduced the problem ? My Raspberry Pi is a 4B with 8GB, BTW. You would still need to have a NAS with a shared drive. I'm not sure the failure is specific to the Ubuntu NAS. I also have a Win10 box with a share on it that I could try with Raspibackup.

madbrain76 commented 5 months ago

Tried it against a Windows 10 share - same problem, hangs in ls -al .

framps commented 5 months ago

However, I see the problem even without verbose/debug logging enabled, so I'm not sure that it does.

Log level verbose just adds the -v flag in dd, tar and rsync. The debug log can be turned off with -l off (default is on) but even then the exec is used to redirect the terminal output and the messages into files. You can use -L varlog to write the log into /var/log/raspiBackup. But I don't think this makes any difference.

What would be more helpful would be a list of all the commands I could grep in the log from. Say "exec: tar xxx"

I'm not sure I get your point: There exist functions in raspiBackup called executeDD, executeTar and executeRsync. They write a debuglog of the command exeited in next step. Is that what you're asking for?

As I recall, I put in a feature request way back for a "pre-action" that I could have used to call my "wakeupnas" script. But alas, that feature request was rejected.

Looks like you use raspiBackup for a long time already. In latest release there exist a lot of extensionpoints where you can plug in custom code

Would it help if I provided you with a system backup that reproduced the problem ? My Raspberry Pi is a 4B with 8GB, BTW.

Unfortunately not. I will just see the ls call and that's it. I have to be able to reproduce it reliably. My 3 Raspberries create a tar backup instead of a rsync backup since weeks - but no hang.

You would still need to have a NAS with a shared drive. I'm not sure the failure is specific to the Ubuntu NAS. I also have a Win10 box with a share on it that I could try with Raspibackup.

I have a Synology NAS.

Are you able to reproduce the issue reliably now? And do you reproduces it via cron or on commandline?

Would you please do me a favor? Please comment out the ls command in cleanupBackupDirectory() by inserting # in front of the command or just delete the line and check whether the issue is gone? If that's true I'll remove the debug statement and we have a workaround without a factor 7 :wink:

madbrain76 commented 5 months ago

re: the commands, I just want to know the exact syntax that is made for each call, so I can isolate them like the shell script I posted earlier.

Yes, I have used raspiBackup for a long time. Thank you for making it.

As far as your 3 Raspberries go, are any of them running Bookworm ? Even if they aren't, I am willing to bet that if I gave you a backup to restore, you could reproduce the problem.

Yes, I can reproduce the issue reliably. I always use the command-line so far, not cron.

I can try your suggestion next time.

I have actually had a bit of catastrophe last night. The file system on my Bookworm SSD became corrupt, to the point that it would no longer boot. This is likely due to so many abrupt power offs while debugging this hang issue.

So, I needed to actually restore a backup made with raspiBackup. I tried to restore it from my Ubuntu system using a USB to SATA adapter. I let the restore operation proceed for over an hour. It never made any progress. iostat showed no activity on /dev/sdk which was the device. I started to inspect the backup directories manually. That's when I found out that many of the backups were incomplete. In particular, my home directory was nowhere to be found in the tar file. And there was a bunch of scripts I wrote that were only stored there. Also, there were no logs written, or msg file either.

Fortunately, I had a few backups that had been done with the slower mount method. Those were actually complete, and I was able to restore one without any issue in just 3 minutes.

framps commented 5 months ago

re: the commands, I just want to know the exact syntax that is made for each call, so I can isolate them like the shell script I posted earlier.

I'm sorry. I still don't understand what you need.

As far as your 3 Raspberries go, are any of them running Bookworm ? Even if they aren't, I am willing to bet that if I gave you a backup to restore, you could reproduce the problem.

All of them are running Bookworm. I recently migrated them from Buster to Bookworm. You're talking about a backup to restore. The ls command is executed at the end of the backup processing :thinking:

I can try your suggestion next time.

Thank you.

Fortunately, I had a few backups that had been done with the slower mount method. Those were actually complete, and I was able to restore one without any issue in just 3 minutes.

Thank goodness you were able to recover. Looks like you canceled the active tar command. But then raspiBackup should have deleted the incomplete backup. And if you killed hanging raspiBackup then the tar command is complete and no data should be missing :thinking:

madbrain76 commented 5 months ago

I tried your suggestion, and it succeeded. Very odd. Now running it again the way it was to confirm the hang. Edit: yes, it definitely hangs again with this call.

I may have killed tar manually once or twice, but if that was the case, I would expect the archive to be incomplete or corrupt. And that didn't seem to be the case. I can certainly go over all the failed backup that don't have logs.

framps commented 5 months ago

I tried your suggestion, and it succeeded. Very odd. Now running it again the way it was to confirm the hang. Edit: yes, it definitely hangs again with this call.

I agree with you. That's strange. I just commented out this statement. If you execute sudo raspiBackup -U -S you will grab the new code.

I may have killed tar manually once or twice, but if that was the case, I would expect the archive to be incomplete or corrupt. And that didn't seem to be the case. I can certainly go over all the failed backup that don't have logs.

Fortunately the cleanup algorithm has changed in next upcoming release and all kinds of aborts or kills of raspiBackup will no longer leave a new incomplete backup in the backup directory. So from my point of view there is no root cause analysis required from your side :wink:

My question is now whether you want to continue to find the root cause of the issue or just accept and use my workaround. If the workaround is a true workaround I'd be ok to close the issue.

Would be great if you monitor now your backups for a few weeks with the updated code and reopen the issue if it happens again for you.

madbrain76 commented 5 months ago

Thanks !

I just updated raspiBackup and did a backup. Here is the output : ... --- RBK0010I: pi64: raspiBackup.sh V0.6.9.1 - 2024-05-29 (dcfab2f) stopped at Wed May 29 12:27:36 PM PDT 2024 with rc 0. --- RBK0026I: Debug logfile saved in /backup/pi64/pi64-tar-backup-20240529-122512/raspiBackup.log.

real    2m24.814s
user    0m13.031s
sys 0m52.176s
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240529-122512/raspiBackup.log
-rwxr-xr-x 1 madbrain madbrain 90050 May 29 12:27 /backup/pi64/pi64-tar-backup-20240529-122512/raspiBackup.log
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240529-122512

As you can see, the backup completed, which is great. Unfortunately, I still experienced a hang doing the second ls manually :-(

I think you were right about me having pressed CTRL-C a few too many times, which explained the truncated tar files.

I believe it's worth finding the root cause of this problem, even if the bug does not lie in raspiBackup itself. A hang problem still remains subsequently even after the successful backup. And that will cause the Pi4 to hang when trying to reboot it. I actually put my Pi4 on smartplug yesterday so I can power cycle it without putting wear on the USB-C connector.

One thing I could still try, after the next power cycling, would be to try to unmount and remount /backup .

framps commented 5 months ago

Unfortunately, I still experienced a hang doing the second ls manually :-(

You have isolated the trigger for the hang in the ls command :+1:

I believe it's worth finding the root cause of this problem, even if the bug does not lie in raspiBackup itself.

Agree. I'm glad you already found a workaround which will work for future folks who will face the same issue. But let's continue to find the root cause :wink:

Glad you're also keen to find the root cause. You're the one who is able to reproduce the issue. I'm not - unfortunately.

would be to try to unmount and remount /backup .

This will cleanup anything on the client and server side. But given you didn't change anything on the server side this will prove the issue is on the client side.

madbrain76 commented 5 months ago

I may have isolated the trigger, but the root cause is still unknown.

I tried umount /backup and then mount -a . Then, ls worked fine on the most recent backup directory. I even listed other files, including the .img and .tar file. No problem.

IMO, it's been quite clear from the start that this was a client side problem. The server isn't experiencing any hang. Even if the server implemented the protocol incorrectly, the client should still fail gracefully. It should never result in a total system hang. There is likely something running in ring 0 that's problematic.

And I haven't changed anything on my server except running update/patches. There is not much running on it other than Samba, VNC, and a JRiver server. I still have my Bullseye SSD, and can verify that raspiBackup doesn't hang there, but I see no need to do that at this time.

madbrain76 commented 5 months ago

FYI, my latest attempt - return early in executeTar. The problem was not reproduced. So, running tar is part of the root cause. But it isn't the only one. I still don't know why my simple 5-line shell script above didn't reproduce it, since it also uses tar. Does github really nto have comment numbers to create links from ? Apparently not. So, I was talking about this script :

https://github.com/framps/raspiBackup/issues/760#issuecomment-2132429827


root@pi64:/home/madbrain# diff -u5 /usr/local/bin/raspiBackup /usr/local/bin/raspiBackup.sh.sav
--- /usr/local/bin/raspiBackup  2024-05-30 05:21:28.958125856 -0700
+++ /usr/local/bin/raspiBackup.sh.sav   2024-05-30 04:22:25.623293328 -0700
@@ -2582,11 +2582,10 @@

 # Removing leading `/' from member names message is annoying. Use grep -v "Removing" to remove the message
 # and use $PIPESTATUS and catch and return the tar RC

 function executeTar() { # cmd flagsToIgnore
-        return 0
    logEntry
    local rc cmd
    cmd="LC_ALL=C $1"
    logItem "$cmd"
    ( eval "$cmd" 2>&1 1>&5 | grep -iv " Removing" | tee -a $MSG_FILE $LOG_FILE; exit ${PIPESTATUS[0]} ) 5>&1
root@pi64:/home/madbrain# vi /usr/local/bin/raspiBackup
root@pi64:/home/madbrain# time raspiBackup
--- RBK0009I: pi64: raspiBackup.sh V0.6.9.1 - 2024-05-29 (dcfab2f) started at Thu May 30 05:16:35 AM PDT 2024.
--- RBK0031I: Checking whether a new version of raspiBackup.sh is available.
--- RBK0151I: Using backuppath /backup with partition type cifs.
--- RBK0044I: Creating backup of boot partition in /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.img.
--- RBK0158I: Creating native tar backup "/backup/pi64/pi64-tar-backup-20240530-051634/pi64-tar-backup-20240530-051634.tar".
--- RBK0085I: Backup of type tar started. Please be patient.
--- RBK0078I: Backup time: 00:00:07.
--- RBK0033I: Please wait until cleanup has finished.
--- RBK0159I: 30 backups kept for tar backup type. Please be patient.
--- RBK0017I: Backup finished successfully.
--- RBK0010I: pi64: raspiBackup.sh V0.6.9.1 - 2024-05-29 (dcfab2f) stopped at Thu May 30 05:16:49 AM PDT 2024 with rc 0.
--- RBK0026I: Debug logfile saved in /backup/pi64/pi64-tar-backup-20240530-051634/raspiBackup.log.

real    0m16.174s
user    0m2.895s
sys 0m5.533s
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240530-051634
total 206723
drwxr-xr-x 2 madbrain madbrain         0 May 30 05:16 .
drwxr-xr-x 2 madbrain madbrain         0 May 30 05:16 ..
-rwxr-xr-x 1 madbrain madbrain 536870912 May 30 05:16 pi64-backup.img
-rwxr-xr-x 1 madbrain madbrain       512 May 30 05:16 pi64-backup.mbr
-rwxr-xr-x 1 madbrain madbrain       198 May 30 05:16 pi64-backup.sfdisk
-rwxr-xr-x 1 madbrain madbrain     89269 May 30 05:16 raspiBackup.log
-rwxr-xr-x 1 madbrain madbrain       977 May 30 05:16 raspiBackup.msg
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.img
-rwxr-xr-x 1 madbrain madbrain 536870912 May 30 05:16 /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.img
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.mbr
-rwxr-xr-x 1 madbrain madbrain 512 May 30 05:16 /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.mbr
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.sfdisk 
-rwxr-xr-x 1 madbrain madbrain 198 May 30 05:16 /backup/pi64/pi64-tar-backup-20240530-051634/pi64-backup.sfdisk
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240530-051634/raspiBackup.log 
-rwxr-xr-x 1 madbrain madbrain 89269 May 30 05:16 /backup/pi64/pi64-tar-backup-20240530-051634/raspiBackup.log
root@pi64:/home/madbrain# ls -al /backup/pi64/pi64-tar-backup-20240530-051634/raspiBackup.msg
-rwxr-xr-x 1 madbrain madbrain 977 May 30 05:16 /backup/pi64/pi64-tar-backup-20240530-051634/raspiBackup.msg
root@pi64:/home/madbrain# 
framps commented 5 months ago

We talked about the redirection stuff of raspiBackup. Maybe you should modify executeTar to

function executeTar() { # cmd flagsToIgnore
    logEntry
    logItem "$cmd"
       ( eval "$cmd" 2>&1 1>&5 | tee -a $MSG_FILE $LOG_FILE) 5>&1
       ignoreErrorRC $? "$2"
    rc=$?
    logExit $rc
    return $rc
}

You may use similar code in your script:

tar -cpi --one-file-system -f "/backup/test/pi64.tar" --warning=no-xdev --numeric-owner --exclude="/backup/*" --exclude="/tmp/raspiBackup.log" --exclude="/tmp/raspiBackup.msg" --exclude='.gvfs' --exclude=/proc/* --exclude=/lost+found/* --exclude=/sys/* --exclude=/dev/* --exclude=/tmp/* --exclude=/swapfile --exclude=/run/* --exclude=/media/* /
 2>&1 1>&5 | tee -a /tmp/raspiBackup.MSG_FILE /tmp/raspiBackup.LOG) 5>&1

If it's still not possible to reproduce the issue with your script we may add the exec calls to spawn addtl threads to log everything of your script as raspiBackup does. That's under the assumption the redirection stuff is the root cause of the issue.

madbrain76 commented 5 months ago

I couldn't get that line to work in my script. There is an extraneous parenthesis. I tried to remove it, but it doesn't work either. I just don't know what syntax to use. And ChatGPT failed me too :)

I changed the function in raspiBackup, and it still hung after ls in the backup directory.

framps commented 5 months ago

I added the stdout/stderr redirection stuff in your script. Please check if the issue now can be reproduced.

madbrain.sh

madbrain76 commented 5 months ago

Unfortunately, the updated script you provided does not reproduce the hang issue. It must be a specific sequence of things that does.

framps commented 5 months ago

:cry: Looks like the only way to get an idea about the root cause is to strip down raspiBackup step by step until the issue is gone.

Do you see any other alternative?

madbrain76 commented 5 months ago

Not sure. I was hoping to get some useful output from "strace -ff -o raspi". But it looks like there are way too many processes involved - no less than 454 ! I tried to write a script to make a pstree-like view of the whole thing and see if anything stands out, but this is a complex task.

madbrain76 commented 5 months ago

I have got good news and bad news.

The good news is that, after updating to the latest Bookworm patches on my system earlier today, raspiBackup succeeds with your latest release, as well as subsequent ls, and even "cat * > /dev/null" to read all the files.

Th bad news is also that it succeeds, and I can no longer reproduce the problem.

framps commented 5 months ago

The good news is that, after updating to the latest Bookworm patches on my system earlier today, raspiBackup succeeds with your latest release,

Interesting latest kernel updates fix the issue

apt list --upgradable
Listing... Done
libdav1d6/stable 1.0.0-2+rpi1+deb12u1 armhf [upgradable from: 1.0.0-2+rpi1]
linux-headers-rpi-v6/stable 1:6.6.31-1+rpt1 armhf [upgradable from: 1:6.1.63-1+rpt1]
linux-headers-rpi-v7/stable 1:6.6.31-1+rpt1 armhf [upgradable from: 1:6.1.63-1+rpt1]
linux-headers-rpi-v7l/stable 1:6.6.31-1+rpt1 armhf [upgradable from: 1:6.1.63-1+rpt1]
linux-image-rpi-v6/stable 1:6.6.31-1+rpt1 armhf [upgradable from: 1:6.1.63-1+rpt1]
linux-image-rpi-v8/stable 1:6.6.31-1+rpt1 arm64 [upgradable from: 1:6.1.63-1+rpt1]
linux-libc-dev/stable,stable 1:6.6.31-1+rpt1 all [upgradable from: 1:6.6.28-1+rpt1]
raspberrypi-sys-mods/stable 20240527+1 armhf [upgradable from: 20231127]
raspi-firmware/stable,stable 1:1.20240529-1 all [upgradable from: 1:1.20240424-1]
raspi-utils/stable 20240402-3 armhf [upgradable from: 20231017-1]
rpi-eeprom/stable,stable 23.1-1 all [upgradable from: 21.4-1]

Th bad news is also that it succeeds, and I can no longer reproduce the problem.

Well, do you think it's bad news? I think that's good news because the issue seems to be fixed and there is no need any more to find the root cause of the issue.

But nevertheless I will not add the ls debug command again.

framps commented 5 months ago

@madbrain76 It's disapointing not to know the root cause of the issue you detected. But I'm frankly happy the issue disappeared with a kernel update.

Do you agree the issue is fixed now - by magic? If yes, please close the issue :wink:

madbrain76 commented 5 months ago

I think technically I could restore one of the backups made before the system patches that fixed it to narrow down the issue. But I'm not sure I want to spend any more time on it.

madbrain76 commented 5 months ago

Actually, I just got it to happen again even with the latest patches :-( so now it seems to be random. Sigh.

framps commented 5 months ago

:cry: But it didn't happen without the ls command - did it?

madbrain76 commented 5 months ago

No, not without ls.

madbrain76 commented 5 months ago

A failed backup does hang, though.

root@pi64:/etc# !time time raspiBackup --- RBK0009I: pi64: raspiBackup.sh V0.6.9.1 - 2024-05-29 (dcfab2f) started at Wed Jun 5 03:25:40 AM PDT 2024. --- RBK0151I: Using backuppath /backup with partition type cifs. --- RBK0044I: Creating backup of boot partition in /backup/pi64/pi64-tar-backup-20240605-032539/pi64-backup.img. --- RBK0158I: Creating native tar backup "/backup/pi64/pi64-tar-backup-20240605-032539/pi64-tar-backup-20240605-032539.tar". --- RBK0085I: Backup of type tar started. Please be patient. tar: /home/madbrain/.cache/mozilla/firefox/u9caw3w7.default-release/cache2/entries: file changed as we read it ??? RBK0021E: Backupprogram for type tar failed with RC 1. --- RBK0033I: Please wait until cleanup has finished. --- RBK0043I: Removing incomplete backup in /backup/pi64/pi64-tar-backup-20240605-032539. This may take some time. Please be patient.

framps commented 5 months ago

:cry: That's bad. It's most probably this command which deletes the incomplete backup directory.

madbrain76 commented 5 months ago

I agree. I think we need to pass this on to kernel and/or CIFS developers. Just not sure where to file the issue. I will go to the Raspberry Pi forums and inquire..

framps commented 5 months ago

I will go to the Raspberry Pi forums and inquire..

Great idea. I also have no clue how to address this isssue.I subscribed to your thread and hope somebody will be able to help on this. It's really a nasty issue.

madbrain76 commented 5 months ago

No response yet in the Pi forum. But I brought it to the attention of one of their engineers today in a different thread. Perhaps there will be a followup.

In the meantime, I am installing Debian bookworm on Virtualbox on my PC. I will try to mount my NAS share the same way as I did on the Pi. I won't be able to run raspibackup as-is, though, since the partitioning scheme will differ. Can you think of any modification that could be made to the raspiBackup script that would allow it to run on Debian x64 ? This may help determine whether the issue is generic to bookworm, or whether it's specific to bookworm on the Pi.

madbrain76 commented 5 months ago

So far, I'm unable to reproduce on Bookworm VM. Any idea how to backup the correct partitions with tar ? These would be /dev/sda1 and /dev/sda3 .

root@osboxes:~# fdisk -l
Disk /dev/sda: 500 GiB, 536870912000 bytes, 1048576000 sectors
Disk model: VBOX HARDDISK   
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0x2725cbb3

Device     Boot     Start        End   Sectors   Size Id Type
/dev/sda1  *         2048  462694399 462692352 220.6G 83 Linux
/dev/sda2       462694400  480272383  17577984   8.4G 82 Linux swap / Solaris
/dev/sda3       480272384 1048573951 568301568   271G 83 Linux

root@osboxes:~# time raspiBackup --unsupportedEnvironment --ignoreAdditionalPartitions
!!! RBK0269W: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
!!! RBK0268W: @@@ Unsupported environment @@@
!!! RBK0269W: @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
--- RBK0009I: osboxes: raspiBackup.sh V0.6.9.1 - 2024-05-29 (dcfab2f) started at Wed Jun 12 07:16:54 PM EDT 2024.
--- RBK0200I: /boot and / located on same partition /dev/sda.
!!! RBK0210W: More than two partitions detected. Only first two partitions are saved.
--- RBK0151I: Using backuppath /backup with partition type cifs.
!!! RBK0157W: No services to stop.
--- RBK0158I: Creating native tar backup "/backup/osboxes/osboxes-tar-backup-20240612-191654/osboxes-tar-backup-20240612-191654.tar".
--- RBK0085I: Backup of type tar started. Please be patient.
--- RBK0078I: Backup time: 00:00:37.
--- RBK0033I: Please wait until cleanup has finished.
--- RBK0159I: 3 backups kept for tar backup type. Please be patient.
--- RBK0017I: Backup finished successfully.
--- RBK0010I: osboxes: raspiBackup.sh V0.6.9.1 - 2024-05-29 (dcfab2f) stopped at Wed Jun 12 07:17:32 PM EDT 2024 with rc 0.
--- RBK0026I: Debug logfile saved in /backup/osboxes/osboxes-tar-backup-20240612-191654/raspiBackup.log.

real    0m39.193s
user    0m2.251s
sys 0m8.840s
root@osboxes:~# ls -al /backup/osboxes/osboxes-tar-backup-20240612-191654/osboxes-tar-backup-20240612-191654.tar
-rwxr-xr-x 1 osboxes osboxes 4934983680 Jun 12 19:16 /backup/osboxes/osboxes-tar-backup-20240612-191654/osboxes-tar-backup-20240612-191654.tar
root@osboxes:~# ls -al /backup/osboxes/osboxes-tar-backup-20240612-191654/
total 2687695
drwxr-xr-x 2 osboxes osboxes          0 Jun 12 19:17 .
drwxr-xr-x 2 osboxes osboxes          0 Jun 12 19:16 ..
-rwxr-xr-x 1 osboxes osboxes 4934983680 Jun 12 19:16 osboxes-tar-backup-20240612-191654.tar
-rwxr-xr-x 1 osboxes osboxes      54479 Jun 12 19:17 raspiBackup.log
-rwxr-xr-x 1 osboxes osboxes       1128 Jun 12 19:17 raspiBackup.msg
root@osboxes:~# uname -a
Linux osboxes 6.1.0-15-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.66-1 (2023-12-09) x86_64 GNU/Linux
root@osboxes:~# 

Looks like this only had kernel 6.1, whereas the Pi was on 6.6 . I'll take a snapshot of the VM and upgrade it now.

madbrain76 commented 5 months ago

It appears Debian has not moved to kernel 6.6 series yet. The CIFS regression may well be between 6.1 and 6.6 kernels. I wonder if there is a way to rollback the kernel on the Pi. Unfortunately, there doesn't seem to be an easy way to do that, other than pulling the old kernel source and building it myself, which is more than I want to do for this investigation. I may however still have a tar backup made shortly after I installed Bookworm, possibly before the kernel was upgraded. I'll look for that. Edit: unfortunately, I don't.

madbrain76 commented 5 months ago

So, I upgraded to kernel 6.6.13 from the bookworm backports repository.

echo "deb http://deb.debian.org/debian bookworm-backports main" >> /etc/apt/sources.list
apt install linux-image-6.6.13+bpo-amd64

The issue with raspiBackup could not be reproduced. That doesn't mean it doesn't exist on Debian x64, just that I can't confirm it is common to both x64 and Raspberry Pi.

framps commented 5 months ago

But I brought it to the attention of one of their engineers today in a different thread. Perhaps there will be a followup.

Thank you very much. Crossing fingers ... I appreciate you spend your expertise to locate the root cause of the issue.

Can you think of any modification that could be made to the raspiBackup script that would allow it to run on Debian x64 ?

I think you don't need raspiBackup to create a full backup. As far as I understand you need raspiBackup just to create a backup of the root partition and no functioning restore. I then will create a dedicated raspiBackup version which skips any Raspberry tests to get raspiBackup to create root partition backup. This may require some iterations - invoke raspiBackup - send me the log - I update raspiBackup to skip tests or something else - and next try ... Let me know if you want me to create this special version.

Any idea how to backup the correct partitions with tar ?

According the messages and result in the backup directory raspiBackup created a backup of the root partition and you can just ignore RBK0021W

The issue with raspiBackup could not be reproduced. That doesn't mean it doesn't exist on Debian x64, just that I can't confirm it is common to both x64 and Raspberry Pi.

:cry:

madbrain76 commented 5 months ago

I think you don't need raspiBackup to create a full backup. As far as I understand you need raspiBackup just to create a backup of the root partition and no functioning restore.

The home directory is stored in the third partition, and that didn't get backed up, only the first one.

framps commented 5 months ago

Then switch to partition oriented mode and use

sudo raspiBackup -P -T "*" -t tar

Then all partitions will be saved. If you want to save only specific partitions, for example 1,2 and 3 and not 4 and 5 use `-T "1 2 3".

madbrain76 commented 5 months ago

Thanks. I saved partitions 1 and 3. But I still couldn't reproduce the problem.

MASTERDR1VE commented 4 months ago

Hi @ all, unfortunately, I need to tag along. I found this thread, because I saw an increase in IOWAIT by 25% each time raspiBackup was scheduled to run. Checking the processes with ps -auxf I got root 16310 0.0 0.0 7284 1664 ? D Jun16 0:01 ls -la /mnt/PI_Backup/PI4/PI4-tar-backup-20240616-003500 which identified the ls command as the culprit. Currently, I'm running 0.6.9.1. I saw in the previous comments, that @framps created a "no ls version", but I also saw comments stating upgrading the kernel to 6.6.13 fixing the issue. I'm running 6.6.20+rpt-rpi-v8, and can reliably reproduce (100%) the "hang", regardless of raspiBackup being started via systemd or directly via shell. Does this help with further investigating the issue? Unfortunately, I can't kill the process, and even umount/mount -a won't stop the IOWAIT, so I need to reboot every now and then to keep the system stable. If further investigation support might not be required: what would be the most reliable "workaround"? The "no ls version", the changed mount as mentioned in https://github.com/framps/raspiBackup/issues/760#issuecomment-2132168844, both?