framps / raspiBackup

Create and keep multiple backup versions of your running Raspberries
https://raspibackup.linux-tips-and-tricks.de
GNU General Public License v3.0
826 stars 75 forks source link

Out of memory crash #620

Closed p1r473 closed 1 year ago

p1r473 commented 1 year ago

Hi Framps! You might remember me from #604 I have 3 Pis and they all have the exact same config for raspiBackup However I noticed that one of my healthchecks has been down every day for the last fews day, on my primary Pi. This started ever since I ran a raspi -d restoration on this Pi I think last week. As far as I can tell, this memory crashing happens only during raspiBackup. The other 2 Pis with identical config work great, and this used to work great. I am also using the same but new SD card I unpackaged in #604. Every time I run raspiBackup on this Pi, eventually I run out of memory and the Pi starts crashing and shuTting down programs

I am watching it with glances and watch top and it will be running fine and smooth as butter then all of a sudden my Pi will just start ravaging and destroying itself.

image

[Logs + config] Syslog from right before to during the terrible crash My config raspiBackup.log

If you can help me fix this, I will be happy to donate, same as last time- this is my primary production Pi.

p1r473 commented 1 year ago

Worth noting: just this week I added the following line to my config: exclude=/var/www/apache/public/BrownNoise.mp3". This is a 1.3GB file. But it is excluded, and I am running a 8gb Pi. You can see in the screenshots I have nearly 6.6gb free. This is the ONLY difference with my 3 raspiBackup configs- this file exists on this Pi only.

I have REMOVED this line from the config, and just tried to raspiBackup again. The results: same issue. Darn, I was hoping that was the issue.

framps commented 1 year ago

That's bad. I unfortunately get 403 (forbidden) when I try to access your logs.

You're the first one who reports an OOM with raspiBackup. I'm sure it's one of the Linux tools used by raspiBackup which causes the OOM.

p1r473 commented 1 year ago

@framps sorry, I fixed the log access. I had it set to private but was supposed to be set to unlisted. Oops!

framps commented 1 year ago

I fixed the log access.

I just downloaded the files. You can now either delete them or protect them again. I just did a quick scan and for me there are sensitive details you most probably don't want to share with the public :wink:

framps commented 1 year ago

I merged the two logs - syslog and raspiBackup log and that's what I see:

2:27:13 - raspiBackup starts (syslog)
2:27:24 - tar starts to backup the root partition (raspiBackup log)
2:37:37 - First OOM activities (syslog)
2:37:46 - tar terminates unsuccessfull with RC 42 (raspiBackup log)
2:37:49 - raspiBackup finished (raspiBackup log)
2:37:51 - Homeassistent and other services startup again (syslog)

As you can see it took 13 seconds from tar start to OOM start. It may be tar is the root cause ... but not sure.

RC 42 from raspiBackup should not be returned normally. If tar terminates the return code is set to the tar return code. Looks like this did not happen because raspiBackup was killed my the OOM.

Question now is why the OOM happend :flushed:

You have a system with a lot of active services, pihole, apache, chromium, nginx, ... As of now I never had somebody to backup a system with this load and you don't stop any running services. As you wrote there is enough free memory which suddenly vanishes.

There is something which may be the root cause for your issue: You use tar and zip the resulting image. You also wrote you now have a huge file for about 1.3GB and the issue popped up when you excluded the file. Maybe it popped up because you added this file?

Please execute following tests and let me know the outcome: 1) Temporarily remove this huge file 2) Don't zip the backup and just create a tar file 3) Create a dd backup on some spare disk

framps commented 1 year ago

Regarding high load: I am running Apache, Pihole, and Homeassistant, yes, but I have 6.6gb free of RAM (see above screenshot).

I see. It's nasty. A detailed analysis of the OOM log messages may help to locate the root - but I never analyzed a OOM log. I did some investigations on this ... but it's hard stuff ...

p1r473 commented 1 year ago

Something is wrong for sure, the system is barely under any load during raspiBackup image image

p1r473 commented 1 year ago

Test 1: Temporarily remove this huge file rm /var/www/apache/public/BrownNoise.mp3/ raspiBackup Results: SUCCESS

What? Success? I will try again to be safe. But, I excluded this file?? And it is only 1.3gb??

framps commented 1 year ago

What? Success? I will try again to be safe. But, I excluded this file?? And it is only 1.3gb??

Try again to confirm it's this file which causes the issue. Why? As of now I have no idea ...

p1r473 commented 1 year ago

Just tried again and it crashed again :( May be was a fluke success

Test 2: Don't zip the backup and just create a tar file Results: PENDING

Test 3: Create a dd backup on some spare disk Results: PENDING

framps commented 1 year ago

:-( You may now either skip to step 3 to test whether it's no an issue with raspiBackup/tar or continue with step 2.

And please note: You test right now at 4 PM but raspiBackup ran last night at 2 AM ...

p1r473 commented 1 year ago

Only thing that runs in the morning on the Pi is RaspiBackup every night at 3AM Ill try step 3 and 4 now Thanks!

framps commented 1 year ago

every night at 3PM

According the logs at 2AM - you may adjust your timzone :wink:

p1r473 commented 1 year ago

oh the one I gave you the logs were adhoc raspiBackup runs (not cronjob), as I was trying to figure out the issue

framps commented 1 year ago

Extract from your syslog:

Feb 28 02:27:13 Harbormaster raspiBackup: Started 0.6.8 (138a72b)
p1r473 commented 1 year ago

Test 2: Don't zip the backup and just create a tar file Results: CRASH image

As you can see it took 13 seconds from tar start to OOM start. It may be tar is the root cause ... but not sure.

It usually lasts for 15-20 min and fails near the end. Though it literally just failed within 1 min this last attempt I just tried.

Im wondering if I should redo the logs as the ones I was doing last night was at 4am so could have errors? The cronjob is supposed to happen at 3, I did a bunch of manual attempts too

WRT timezone, I checked raspiconfig and my city is correctly set, so it was a manual raspibackup log and not the cron job

p1r473 commented 1 year ago

I will gather a new set of logs

I will donate heavily for your help ;) with a big bonus if you help solve the issue ;) Worst case (I HOPE) is I just keep raspiBackup restoring until I find an old backup that doesnt crash

p1r473 commented 1 year ago

🚧 Will make logs public tomorrow :) 🚧

New config Changes

[Attempt 1 - tar with compression] Status: ❌
raspilog Syslog

[Attempt 2 - tar withOUT compression] Status: ❌ raspilog Syslog

[Attempt 3 - dd with compression] Status: 🟢

[Attempt 4 - RBK0085I] Status: ❌ Syslog

mkdir Harbormaster-tgz-backup-20230228-184200
tar -cpi --one-file-system -z -f "/mnt/Harbormaster/Harbormaster/Harbormaster-tgz-backup-20230228-184200/Harbormaster-tgz-backup-20230228-184200.tgz" --warning=no-xdev --numeric-owner --exclude="/mnt/Harbormaster/*" --exclude="/tmp/raspiBackup.log" --exclude="/tmp/raspiBackup.msg" --exclude='.gvfs' --exclude=/proc/* --exclude=/lost+found/* --exclude=/sys/* --exclude=/dev/* --exclude=/tmp/* --exclude=/boot/* --exclude=/run/* --exclude=/media/* --exclude=/mnt/* --exclude=/var/lib/samba/private/msg.sock/* --exclude=/etc/pihole/* --exclude=/var/log/* --exclude=/var/lib/php/sessions/* --exclude=/var/cache/* --exclude=/var/spool/postfix/dev/log --exclude=/home/pi/.local/share/Trash/*--exclude=/root/.local/share/Trash* --exclude=/.Trash/* --exclude=/var/www/apache/public/BrownNoise.mp3 /
framps commented 1 year ago

It usually lasts for 15-20 min and fails near the end. Though it literally just failed within 1 min this last attempt I just tried.

Ok. According the first log you provided it took just seconds and that's why I thought it's not raspiBackup. Please make sure the syslog and raspiBackup log are in the same time slot and I can merge them :wink:

Worst case (I HOPE) is I just keep raspiBackup restoring until I find an old backup that doesnt crash

This may only work if there is not any external environment stuff the root cause for this issue. But let's be optimistic. Given the fact raspiBackup succeeds with dd it sounds like there is an issue with tar.

I'm busy today and will be able to check the logs at around 8 pm my time (CET).

If you find time you may also execute a plain tar backup. Search for RBK0085I in the debug log. The debug line which includes the --exclude statements is the command executed by raspiBackup. For example that's the rsync command executed in my environment:

--- RBK0085I: Backup of type rsync started. Please be patient.
20221204-202706 DBG 2489:                      --> executeRsync 
20221204-202706 DBG 2492:                          --- rsync --exclude="/backup/*"                      --exclude="/tmp/raspiBackup.log"                        --exclude="/tmp/raspiBackup.msg"                
        --exclude='.gvfs'                       --exclude=/proc/*                       --exclude=/lost+found/*                         --exclude=/sys/*                        --exclude=/dev/*                
        --exclude=/boot/*                       --exclude=/tmp/*                        --exclude=/run/*                        --exclude=/media/*                      --exclude=/@HOSTNAME@-backup.img --exclude=/@HOSTNAME@-backup.tmg --exclude=/@HOSTNAME@-backup.sfdisk --exclude=/@HOSTNAME@-backup.mbr --exclude=/raspiBackup.log --exclude=/raspiBackup.msg                     --exclude /var/log/journal             
                                --numeric-ids                   -aHAx --delete                                          --info=NAME0                    /                       "/backup/@HOSTNAME@/@HOSTNAME@-rsync-backup-20221204-202630"                    
p1r473 commented 1 year ago

Sure I will make the logs private around 8PM CET

p1r473 commented 1 year ago

Looks like the issue is with tar, the tar crashed, OOM. I turned on swap file (size 1024) and still crashes

sudo dphys-swapfile swapoff
nano /etc/dphys-swapfile
->CONF_SWAPSIZE=1024
sudo dphys-swapfile setup
sudo dphys-swapfile swapon
framps commented 1 year ago

I just downloaded the logs. As you wrote even if you start tar alone you get an OOM :cry: tar is a very very old Linux backup tool and I'm surprised it causes the OOM. Maybe it's some change in your environment which causes tar to crash? But nevertheless tar shouldn't create an OOM.

I will check the logs now.

p1r473 commented 1 year ago

I did do lots of tinkering on the pi this week and I'm not sure what I've changed that caused this 😞 This is exactly why I love and use your tool so heavily. Cuz I can always just rollback the entire image. Hopefully something the logs stands out and if not I will start using a second SD card and trying to roll back until I find an image that doesn't crash on tar

Usually Linux is so stable...

At least it isn't a typo this time

What tool do you use to piece the logs together?

framps commented 1 year ago
20230228-191150 DBG 2505:                  --> executeTar 
20230228-191150 DBG 2508:                      --- tar      -cpi --one-file-
20230228-191819 DBG 4626:                      --> cleanup 

According the logs it takes 6:29 minutes for a tar backup to fail

20230228-184213 DBG 2505:                  --> executeTar 
20230228-184213 DBG 2508:                      --- tar      -cpi --one-file-system              20230228-185359 DBG 4626:                      --> cleanup 

and 11:46 minutes for the tgz backup to fail. That's interesting because I expected the tar backup to fail later because zip will use more memory.

Size of tar file when tar aborted was 20230228-191821 DBG 4391: -rwxr-xr-x 1 root root 6617088000 Feb 28 19:11 @HOSTNAME@-tar-backup-20230228-191138.tar

and size of tgz when tar aborted was 20230228-185402 DBG 4391: -rwxr-xr-x 1 root root 3553509376 Feb 28 18:42 @HOSTNAME@-tgz-backup-20230228-184200.tgz

Root file system size is 20230228-184205 DBG 5566: /dev/root 117G 15G 97G 13% /

I found some comments about tar and OOM (see here). Not sure whether this is applicable for you. A split of tar backups will require a feature request for raspiBackup. You may try to use these Options when you use the plain tar command to check whether this helps.

15G is nothing for tar and tar should be able to save this amount of data without any issues.

I see following options for you to get rid of the issue:

1) Use rsync instead of tar I see you use cifs which causes you to use tar. Can't you use nfs which allows to use rsync? In addition rsync will create incremental backups in contrast to tar which creates a full backup all the time.

2) Execute some additional tests just with the tar command (without raspiBackup involvement) 2a) Restore on a spare disk instead of the cifs partition. Maybe cifs runs out of memory because of some kind of buffering. 2b) Reduce the number of exclude Options. It's just a guess but you use a lot of exclude statements.

3) I noticed something in the logs. Maybe you know what's causing this: 3a) Check for swap usage

Mar  1 12:24:03 Harbormaster kernel: [ 1114.758132] 0 pages in swap cache
Mar  1 12:24:03 Harbormaster kernel: [ 1114.758135] Free swap  = 102396kB
Mar  1 12:24:03 Harbormaster kernel: [ 1114.758137] Total swap = 102396kB
Mar  1 12:24:03 Harbormaster kernel: [ 1114.758140] 1949696 pages RAM

No swap space is used. Did you disable swap usage?

3b) Disable RAM logging

20230228-184204 DBG 6201:              /dev/mmcblk0p1 on /boot type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=437,iocharset=ascii,shortname=mixed,errors=remount-ro)
20230228-184204 DBG 6201:              /dev/mmcblk0p2 on /var/hdd.log type ext4 (rw,noatime)
20230228-184204 DBG 6201:              log2ram on /var/log type tmpfs (rw,nosuid,nodev,noexec,noatime,size=5120000k,mode=755)
2

You use RAM for logging. I use a plain RaspbianOS lite on my servers and they don't do any RAM caching. Maybe tar just makes the barrel to overflow?

3) I spend time to understand the OOM system messages and hopefully find something which helps to find the root cause

4) Revert your system from existing backups until raspiBackup with tar will work again

framps commented 1 year ago

did do lots of tinkering on the pi this week and I'm not sure what I've changed that caused this

I detected in the logs you now use a lot of additional features of raspiBackup. Great you love them. But as we see that's not the root cause of your issue :cry:

What tool do you use to piece the logs together?

Just a plain editor and copy and paste :smile: . Given the fact a plain tar command also fails for you a merge of the two logs is no longer needed. I thought maybe raspiBackup causes in a later step (backup cleanup) an OOM.

p1r473 commented 1 year ago

I see you use cifs which causes you to use tar. Can't you use nfs which allows to use rsync? In addition rsync will create incremental backups in contrast to tar which creates a full backup all the time.

I would much prefer this actually. Especially as you implemented my feature request here https://github.com/framps/raspiBackup/issues/366

  1. Execute some additional tests just with the tar command (without raspiBackup involvement) 2a) Restore on a spare disk instead of the cifs partition. Maybe cifs runs out of memory because of some kind of buffering.

Could you elaborate? Right now I am hosting my backups on my Windows PC server as it has a lot of hard drive space. Can you point me in the direction of the command?

2b) Reduce the number of exclude Options. It's just a guess but you use a lot of exclude statements.

I actually just added 4 of those excludes yesterday. My other Pis have them too with no issues.

No swap space is used. Did you disable swap usage?

I just noticed this today too. I enabled the swap and still crashes https://github.com/framps/raspiBackup/issues/620#issuecomment-1450543979

3b) Disable RAM logging

I really like log2ram, it increases the life of my SD cards. Other 2 Pis work with it very well and this one used to too!

  1. I spend time to understand the OOM system messages and hopefully find something which helps to find the root cause

I am happy to compensate you for your time doing this! But this is up to you as I will start on 4 - reverting. Im going to take out this SD card and I can plug it back in for testing if you find anything.

  1. Revert your system from existing backups until raspiBackup with tar will work again

I am going to see if I can find a backup that works. Ill restore each backup until I find one that works

p1r473 commented 1 year ago

The reason I am using CIFS (I think) is because my file share is on Windows. It looks like NFS is for Linux mounts, I think?

framps commented 1 year ago

Yes. nfs is a Linux feature. I thought you maybe use some NAS and usually they run on Linux and support nfs. Maybe there is support on Windows for nfs but I don't know. I'm not familiar with windows.

p1r473 commented 1 year ago

I see you use cifs which causes you to use tar. Can't you use nfs which allows to use rsync? In addition rsync will create incremental backups in contrast to tar which creates a full backup all the time.

Based on this, I guess I cant use rsync with CIFS then https://www.linux-tips-and-tricks.de/images/decisiontree_en.dia.jpg

Thats too bad, I love the idea of incremental. I have been looking at synology!

@framps preliminary googling shows perhaps rsync does indeed work to a CIFS mount? https://serverfault.com/questions/877519/rsync-how-to-handle-backup-to-cifs-volume https://bobcares.com/blog/rsync-from-linux-to-windows-share/

Is the issue with hard links? Based on these articles showing rsync to CIFS, perhaps it doesnt require ext3/4 but could work on NTFS?

framps commented 1 year ago

I would much prefer this actually. Especially as you implemented my feature request here #366

It's not implemented. It's on my backlog list. As of now it's only you who asked for this feature. I don't start to work on any feature where where the quotient #number of users/#effort is nearly zero.

Could you elaborate? Right now I am hosting my backups on my Windows PC server as it has a lot of hard drive space. Can you point me in the direction of the command?

It's just a guess. So it's up to you to decide whether you want to test this: Attach a spare disk, mount this disk and then create the backup on the disk instead of the cifs mounted partition.

I actually just added 4 of those excludes yesterday. My other Pis have them too with no issues.

As I wrote: It's just a guess. I think you can ignore this.

I just noticed this today too. I enabled the swap and still crashes #620 (comment)

Did you verify it's used in the OOM messages? Frankly I think it's the wrong approach to add additional swap space. If a system start to swap there is a memory issue.

I really like log2ram, it increases the life of my SD cards. Other 2 Pis work with it very well and this one used to too!

I see your point. Do you know you can configure an overlay filesystem with raspi-config ? Not sure whether this is an option for you.

I am happy to compensate you for your time doing this! But this is up to you as I will start on 4 - reverting. Im going to take out this SD card and I can plug it back in for testing if you find anything.

In one of my FAQ I wrote I only support any issues related with raspiBackup. As we found out you have an issue which is not related to raspiBackup but tar. OOM is an interesting topic and I will look into the OOM logs. But this will take time :cry: If it's OK with you to revert until you get a system which can be saved with raspiBackup again - that's my preference :wink: Crossing fingers you are successful. According the logs you have about 17 backups to go back ...

p1r473 commented 1 year ago

No problems at all. Thanks for your help!

framps commented 1 year ago

Is the issue with hard links? Based on these articles showing rsync to CIFS, perhaps it doesnt require ext3/4 but could work on NTFS?

rsync can be used on a CIFS mounted partition. But this will be a full backup all the time because of missing support for hardlinks. A user of raspiBackup had serious trouble when he restored his system which he saved on an NTFS formatted partition. Root cause was all file attributes were silently changed to root, including any user files :cry: Therefore I added a check in raspiBackup and now any rsync backup to an NTFS partition is rejected with an error message.

In former times of raspiBackup there was code which allowed to backup with rsync even the target filesystem doesn't support hardlinks and a full backup was created. It's no longer supported by raspiBackup. tar is a full backup and rsync is an incremental backup all the time created by raspiBackup.

p1r473 commented 1 year ago

Thank you, very informative!

framps commented 1 year ago

Just let me know if you succeeded to revert :wink:

framps commented 1 year ago

Would be interesting whether you can backup your system with rsync. If you have a spare disk handy just create an ext4 partition on this disk and create a rsync backup :wink:

framps commented 1 year ago

Thank you very much for your donation but I also would have helped you on this issue anyhow. I try to fix any issue with raspiBackup if possible. Unfortunately it looks like there is some issue with tar :cry:

p1r473 commented 1 year ago

Just let me know if you succeeded to revert 😉

Succeeded, had to go all the way back to 24th of last month. Thank goodness for rapsiBackup!!!!

framps commented 1 year ago

Do you remember what you changed/updated? Would be interesting to know what's the root cause of your issue and whether it's a tar issue or something else.

p1r473 commented 1 year ago

I did a lot of tinkering last 2 weeks 🤓 I played around with a lot of new things and I'm not sure which broke that

framps commented 1 year ago

According the logs your Raspberry seems to be a production system which should run 7/24.

Whenever I have major updates on one of my production Raspberries I create a raspiBackup snapshot first just to have a backup with some meaningful comment for a longer period just in case I have to revert the update. Then I clone the system with raspiBackup (restore latest backup on another SD card) and update the cloned system running on a spare Raspberry. When all regression tests were successful I shut down the production server, insert the SD card with the system I updated and start the server again.

This reduces the downtime of the server to a minimum and protects to get any issues because of the update :wink: