datto / dattobd

kernel module for taking block-level snapshots and incremental backups of Linux block devices
GNU General Public License v2.0
561 stars 120 forks source link

Ubuntu 20.04 Memory Leak #294

Closed Sir-Alex-L closed 10 months ago

Sir-Alex-L commented 1 year ago

It seems that dattobd is not releasing RAM upon backup completion, The RAM usage jumps up with every backup taken. The VM is running on Hyper-V. Only fix is to reboot the VM. Tried restarting the kernel module with no difference.

Journalctl entry

Jan 13 10:20:08 dlad[1016]: Beginning a dla backup: 8f2e65de-4f5f-4978-a5a8-5927c92b74e8
Jan 13 10:20:15 sudo[51371]: pam_unix(sudo:session): session closed for user root
Jan 13 10:20:18 multipathd[597]: datto0: HDIO_GETGEO failed with 25
Jan 13 10:20:18 multipathd[597]: datto0: failed to get udev uid: Invalid argument
Jan 13 10:20:18 multipathd[597]: datto0: failed to get unknown uid: Invalid argument
Jan 13 10:20:21 multipathd[597]: datto1: HDIO_GETGEO failed with 25
Jan 13 10:20:21 multipathd[597]: datto1: failed to get udev uid: Invalid argument
Jan 13 10:20:21 multipathd[597]: datto1: failed to get unknown uid: Invalid argument
Jan 13 10:20:21 kernel: EXT4-fs (datto1): write access unavailable, skipping orphan cleanup
Jan 13 10:20:21 kernel: EXT4-fs (datto1): mounted filesystem without journal. Opts: norecovery
Jan 13 10:20:26 sudo[51737]:  itadmin : TTY=pts/1 ; PWD=/home/itadmin ; USER=root ; COMMAND=/usr/sbin/swapon -a
Jan 13 10:20:26 sudo[51737]: pam_unix(sudo:session): session opened for user root by itadmin(uid=0)
Jan 13 10:20:26 sudo[51737]: pam_unix(sudo:session): session closed for user root
Jan 13 10:20:26 kernel: Adding 2097148k swap on /swap.img.  Priority:-2 extents:4 across:2244604k FS
Jan 13 10:20:28 sudo[51751]:  itadmin : TTY=pts/1 ; PWD=/home/itadmin ; USER=root ; COMMAND=/usr/bin/htop
Jan 13 10:20:28 sudo[51751]: pam_unix(sudo:session): session opened for user root by itadmin(uid=0)
Jan 13 10:20:30 systemd[1]: tmp-dattoMountRoot-6919c795\x2d02ff\x2d4dcd\x2db6e8\x2d4c87d7709ba0.mount: Succeeded.
Jan 13 10:20:30 systemd[5759]: tmp-dattoMountRoot-6919c795\x2d02ff\x2d4dcd\x2db6e8\x2d4c87d7709ba0.mount: Succeeded.
Jan 13 10:20:30 systemd[5759]: tmp-dattoMountRoot-D682\x2d1F98.mount: Succeeded.
Jan 13 10:21:34 multipathd[597]: datto0: path already removed
Jan 13 10:20:30 systemd[1]: tmp-dattoMountRoot-D682\x2d1F98.mount: Succeeded.
Jan 13 10:21:34 multipathd[597]: datto1: path already removed
Jan 13 10:21:34 [1016]: Ending a dla backup: 8f2e65de-4f5f-4978-a5a8-5927c92b74e8. Result:  SUCCESSFUL!

RAM usage before backup

itadmin@flex:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.7Gi       4.5Gi       1.9Gi       3.0Mi       1.4Gi       2.9Gi
Swap:         2.0Gi          0B       2.0Gi

RAM usage after the backup

itadmin@flex:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.7Gi       6.4Gi       134Mi       3.0Mi       1.2Gi       1.1Gi
Swap:         2.0Gi       1.0Mi       2.0Gi

Kernel: Linux version 5.4.0-136-generic (buildd@lcy02-amd64-068) (gcc version 9.4.0 (Ubuntu 9.4.0-1ubuntu1~20.04.1)) #153-Ubuntu SMP Thu Nov 24 15:56:58 UTC 2022

Kernel log

Jan 13 10:20:21 flex kernel: [43067.243895] EXT4-fs (datto1): write access unavailable, skipping orphan cleanup
Jan 13 10:20:21 flex kernel: [43067.243931] EXT4-fs (datto1): mounted filesystem without journal. Opts: norecovery
Jan 13 10:38:12 flex kernel: [44134.709523] datto: failed to locate system call table, persistence disabled
Jan 13 10:45:37 flex kernel: [44580.047702] datto: failed to locate system call table, persistence disabled
Jan 13 10:54:53 flex kernel: [45135.956321] datto: failed to locate system call table, persistence disabled
Jan 13 11:09:16 flex kernel: [45999.051663] EXT4-fs (datto1): write access unavailable, skipping orphan cleanup
Jan 13 11:09:16 flex kernel: [45999.051699] EXT4-fs (datto1): mounted filesystem without journal. Opts: norecovery
Jan 13 11:19:16 flex kernel: [46599.064570] EXT4-fs (datto1): write access unavailable, skipping orphan cleanup
Jan 13 11:19:16 flex kernel: [46599.064609] EXT4-fs (datto1): mounted filesystem without journal. Opts: norecovery

dba_2023-01-13.log

Installed package version dlad/focal,now 2.7.1.1-1.1ubuntu20.04 amd64 [installed]

The kernel module that get installed with above package is

Module:  dattobd
Version: 0.10.15
Kernel:  5.4.0-136-generic (x86_64)
adnanshaheen commented 1 year ago

Thank you for posting this. This is a known issue, and we are working on a fix.

ScottMonolith commented 1 year ago

@adnanshaheen any updates? Has this issue been resolved and is still marked 'open'?

I am running the latest dlad agent, although the version shows something odd:

dpkg -l dlad 3.0.16.0-91.5ubuntu18.04

Odd because it shows ubuntu18.04, I am running Ubuntu 22.04.2.

adnanshaheen commented 1 year ago

I don't have any insight any longer. I am not part of the project anymore.

Sir-Alex-L commented 1 year ago

According to Datto article it's supposed to be supported. https://continuity.datto.com/help/Content/kb/unified-continuity/siris-alto-nas/360040893811.html

Although with Kaseya in charge, I wouldn't be surprised they moved all development in-house and abandoned this github page, hence the issue remains open.

ScottMonolith commented 1 year ago

According to Datto article it's supposed to be supported. https://continuity.datto.com/help/Content/kb/unified-continuity/siris-alto-nas/360040893811.html

Although with Kaseya in charge, I wouldn't be surprised they moved all development in-house and abandoned this github page, hence the issue remains open.

Yes, Ubuntu 22.04 should be supported based on that article. The agent does run and does backup successfully but causes a lot of memory consumption.

After keeping the dlad service stopped I can confirm the huge amount of memory usage has also stopped.

I've already had the MSP that manages Datto for us open a ticket. I found this bug and it's definitely a problem still.

Sir-Alex-L commented 1 year ago

So I tried Debian 11 and it crashed on me within 1 day! OOM killed it with over 8GB of RAM usage. I only had 8GB on the machine.

Sir-Alex-L commented 1 year ago

Well this was never fixed....

We are aware an issue where DLA running on Ubuntu 20.04 and Debian 11 causes memory usage to bloat which may result in performance issues on the protected machine

To mitigate these symptoms you can reboot the protected machine to free the memory usage.

Currently there is no permanent workaround that our Support team can apply to avoid the original behavior, but we have made our Engineering team aware of this issue and are working together so that a permanent fix can be applied in the future.

ScottMonolith commented 1 year ago

Well this was never fixed....

Indeed, I can confirm Ubuntu 22.04 and likely Debian 12 (although I haven't tested that personally) as well also suffer from this memory leak bug. Likely the bug is at the kernel level, so any Linux distro that is using a 5.x+ kernel...

Did the support rep provide a place for you to track progress on the bug? Or should we just wait for a new version of the Datto Linux agent? Hopefully it won't be another 3 years for the next release...

Sir-Alex-L commented 1 year ago

Told me they'll let me know when the fix is in place, no ETA Funny thing, Elastio fixed the issue already months ago, they already have support for kernel 6.0 I wonder why they just won't merge that fork. https://github.com/elastio/elastio-snap

I tried compiling that driver with replacing every function call back to dattobd name. It compiles and runs. The issue is dlad service does some driver verification which doesn't pass.

Fri 28/04/23 11:51:31 am - Handling backup start call
Fri 28/04/23 11:51:31 am - Backup transport interface: mercuryftp
Fri 28/04/23 11:51:31 am - Generated new backup ID: 4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2
Fri 28/04/23 11:51:31 am - Backup engine beginning creation of backup 4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2
Fri 28/04/23 11:51:31 am - Driver version: 0.12.2.0, Agent version: 3.0.16.0
Fri 28/04/23 11:51:31 am - Creating backup 4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2
Fri 28/04/23 11:51:31 am - Launching backup thread...
Fri 28/04/23 11:51:31 am - Backup run beginning...
Fri 28/04/23 11:51:31 am - SleepPolicyNever is controlling sleep state.
Fri 28/04/23 11:51:31 am - Beginning running backup phases...
Fri 28/04/23 11:51:31 am - Beginning execution of the Populate Volumes phase.
Fri 28/04/23 11:51:31 am - Reporting device /dev/sda1 is not a dattobd device because: Cannot find "datto" in /proc/devices. Dattobd is not loaded.
Fri 28/04/23 11:51:31 am - Reporting device /dev/sda2 is not a dattobd device because: Cannot find "datto" in /proc/devices. Dattobd is not loaded.
Fri 28/04/23 11:51:31 am - Finished execution of the Populate Volumes phase.
Fri 28/04/23 11:51:31 am - Beginning execution of the Restore Resume State phase.
Fri 28/04/23 11:51:31 am - Not opening resume file due to differing context. File: /datto.rsm | Context: Backup ID 'cb135537-cd18-4880-ba46-2afb87e26ddd' (Actual) '4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2' (Expected)
Fri 28/04/23 11:51:31 am - Finished execution of the Restore Resume State phase.
Fri 28/04/23 11:51:31 am - Beginning execution of the Create Transports phase.
Fri 28/04/23 11:51:31 am - Backup transport interface: MercuryFTP (TLS)
Fri 28/04/23 11:51:31 am - Finished execution of the Create Transports phase.
Fri 28/04/23 11:51:31 am - Beginning execution of the Validate Local Targets phase.
Fri 28/04/23 11:51:31 am - Not opening resume file due to differing context. File: /datto.rsm | Context: Backup ID 'cb135537-cd18-4880-ba46-2afb87e26ddd' (Actual) '4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2' (Expected)
Fri 28/04/23 11:51:31 am - Finished execution of the Validate Local Targets phase.
Fri 28/04/23 11:51:31 am - Beginning execution of the Validate Remote Targets phase.
Fri 28/04/23 11:51:31 am - Not opening resume file due to differing context. File: /datto.rsm | Context: Backup ID 'cb135537-cd18-4880-ba46-2afb87e26ddd' (Actual) '4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2' (Expected)
Fri 28/04/23 11:51:31 am - Finished execution of the Validate Remote Targets phase.
Fri 28/04/23 11:51:31 am - Preloading done, starting final pass for backup id 4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2
Fri 28/04/23 11:51:31 am - Not opening resume file due to differing context. File: /datto.rsm | Context: Backup ID 'cb135537-cd18-4880-ba46-2afb87e26ddd' (Actual) '4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2' (Expected)
Fri 28/04/23 11:51:31 am - Beginning execution of the Start Snapshotting phase.
Fri 28/04/23 11:51:31 am - Quiescing volumes...
Fri 28/04/23 11:51:31 am - Volumes Quiesced
Fri 28/04/23 11:51:31 am - Unable to parse dattobd info from : exception: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Fri 28/04/23 11:51:31 am - Failed to validate driver!
Fri 28/04/23 11:51:31 am - Unquiescing volumes...
Fri 28/04/23 11:51:31 am - Volumes Unquiesced
Fri 28/04/23 11:51:31 am - Exception caught trying to create snapshots: Failed to validate driver!
Fri 28/04/23 11:51:31 am - Beginning execution of the Backup Pass Finalization phase.
Fri 28/04/23 11:51:31 am - Finished execution of the Backup Pass Finalization phase.
Fri 28/04/23 11:51:31 am - Backup error occurred during backup run: An unexpected error occurred while attempting to snapshot: Failed to validate driver! (SNAPSHOT)
Fri 28/04/23 11:51:31 am - Beginning execution of the Transition To Incremental phase.
Fri 28/04/23 11:51:31 am - Unable to parse dattobd info from : exception: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Fri 28/04/23 11:51:31 am - Cannot create incremental history, device does not exist
Fri 28/04/23 11:51:31 am - Could not find starting state for volume: /dev/sda2
Fri 28/04/23 11:51:31 am - Failed to revert device /dev/sda2
Fri 28/04/23 11:51:31 am - Unable to parse dattobd info from : exception: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Fri 28/04/23 11:51:31 am - Unable to find existing device /dev/sda2 to transition.
Fri 28/04/23 11:51:31 am - Failed to transition volume 9aa5cab2-6485-4de2-b2a9-761f99864171 (/) to incremental. This volume may diff-merge next backup.
Fri 28/04/23 11:51:31 am - Unable to parse dattobd info from : exception: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Fri 28/04/23 11:51:31 am - Unable to get device info for: /dev/sda2
Fri 28/04/23 11:51:31 am - Unable to parse dattobd info from : exception: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Fri 28/04/23 11:51:31 am - Unable to find device for getSnapshotPath: 9aa5cab2-6485-4de2-b2a9-761f99864171 (/)
Fri 28/04/23 11:51:31 am - Unable to parse dattobd info from : exception: [json.exception.parse_error.101] parse error at line 1, column 1: syntax error while parsing value - unexpected end of input; expected '[', '{', or a literal
Fri 28/04/23 11:51:31 am - Finished execution of the Transition To Incremental phase.
Fri 28/04/23 11:51:31 am - Backup 4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2 FAILED!
Fri 28/04/23 11:51:31 am - Backup run finished
Fri 28/04/23 11:51:31 am - SleepPolicyNever is no longer controlling sleep state.
Fri 28/04/23 11:51:33 am - Requesting cancelation of backup ID: 4682b5f1-c4ef-4dbe-a024-e4ef945a1ab2 
ScottMonolith commented 1 year ago

Told me they'll let me know when the fix is in place, no ETA Funny thing, Elastio fixed the issue already months ago, they already have support for kernel 6.0 I wonder why they just won't merge that fork. https://github.com/elastio/elastio-snap

I tried compiling that driver with replacing every function call back to dattobd name. It compiles and runs. The issue is dlad service does some driver verification which doesn't pass.

Wow, I had no idea Datto's agent was forked by another company... that repo looks to be much more active than Datto's too. I doubt they'll be able to just merge all of the changes, but that does give me hope they'll be able to fix it sooner rather than later.

Sir-Alex-L commented 1 year ago

Told me they'll let me know when the fix is in place, no ETA Funny thing, Elastio fixed the issue already months ago, they already have support for kernel 6.0 I wonder why they just won't merge that fork. https://github.com/elastio/elastio-snap I tried compiling that driver with replacing every function call back to dattobd name. It compiles and runs. The issue is dlad service does some driver verification which doesn't pass.

Wow, I had no idea Datto's agent was forked by another company... that repo looks to be much more active than Datto's too. I doubt they'll be able to just merge all of the changes, but that does give me hope they'll be able to fix it sooner rather than later.

Eh, not sure, that fork has been active for 2 years now. For some reason Datto refused to clone or merge and relied on recreating the wheel. Either they were unaware of the fork, even though it was mentioned a few times in the past #278 and #265. Or for some philosophical or legal reason they haven't touched it. Even though both are GPL 2.0 License

Swistusmen commented 11 months ago

@Sir-Alex-L We have implememnted some improvement in this topic together with commit 2ad52cb2f0cf8c347083ca84254e89908385ff9e. You can check it out and share how does it works for you

madgamer98 commented 11 months ago

I'm running an Ubuntu 20.04.6 LTS (Focal Fossa) server and wanted to give my experience compiling and running https://github.com/datto/dattobd/commit/2ad52cb2f0cf8c347083ca84254e89908385ff9e.

Kernel: Linux ubuntu 5.4.0-155-generic #172-Ubuntu SMP Fri Jul 7 16:10:02 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Upon trying to run make I receive an error that bio_free_pages is a static declaration in src/bio_helper.c whereas the declaration of it in the Linux development headers include/linux/bio.h is not a static declaration. This prevented the make command from finishing due to that error.

Here's the output of make for that error:

root@ubuntu:~/dattobd# make
make -C src
make[1]: Entering directory '/root/dattobd/src'
if [ ! -f kernel-config.h ] || tail -1 kernel-config.h | grep -qv '#endif'; then mkdir configure-tests/feature-tests/build; ./genconfig.sh "5.4.0-155-generic" "-w"; fi;
make -C /lib/modules/5.4.0-155-generic/build M=/root/dattobd/src modules
make[2]: Entering directory '/usr/src/linux-headers-5.4.0-155-generic'
  CC [M]  /root/dattobd/src/bio_helper.o
In file included from /root/dattobd/src/bio_helper.c:9:
/root/dattobd/src/bio_helper.h:185: warning: "bio_for_each_segment_all" redefined
  185 |         #define bio_for_each_segment_all(bvl, bio, i)    \
      |
In file included from ./include/linux/blkdev.h:21,
                 from /root/dattobd/src/includes.h:11,
                 from /root/dattobd/src/bio_helper.c:7:
./include/linux/bio.h:135: note: this is the location of the previous definition
  135 | #define bio_for_each_segment_all(bvl, bio, iter) \
      |
/root/dattobd/src/bio_helper.c:632:13: error: static declaration of ‘bio_free_pages’ follows non-static declaration
  632 | static void bio_free_pages(struct bio *bio)
      |             ^~~~~~~~~~~~~~
In file included from ./include/linux/blkdev.h:21,
                 from /root/dattobd/src/includes.h:11,
                 from /root/dattobd/src/bio_helper.c:7:
./include/linux/bio.h:461:13: note: previous declaration of ‘bio_free_pages’ was here
  461 | extern void bio_free_pages(struct bio *bio);
      |             ^~~~~~~~~~~~~~
make[3]: *** [scripts/Makefile.build:270: /root/dattobd/src/bio_helper.o] Error 1
make[2]: *** [Makefile:1774: /root/dattobd/src] Error 2
make[2]: Leaving directory '/usr/src/linux-headers-5.4.0-155-generic'
make[1]: *** [Makefile:17: default] Error 2
make[1]: Leaving directory '/root/dattobd/src'
make: *** [Makefile:24: driver] Error 2

By removing the static assignment of the bio_free_pages function in src/bio_helper.c I was able to compile it on my machine. and confirm through /proc/datto_info that the kernel module was running version "0.11.3".

However I'm still receiving similar memory leak issues as before. It still seems to slowly accumulate memory that doesn't get released as it continues to run backups.

Previously it would continue to allocate memory until the OOM Killer would invoke and eventually kill the Datto agent. So far the "0.11.3" version of the module seems to stop allocating memory right before that point leaving some small memory available. I will provide a further update if the memory does eventually trigger the OOM Killer.

Edit: Over the weekend it did eventually trigger the OOM Killer killing the Datto Agent.

Swistusmen commented 10 months ago

@madgamer98 thanks for verification. Are you able to share steps which let you verify this behavior so we could reproduce it? We were just running backups with Datto Linux Agent and was looking how much memory has been taken, never gone to OOM

madgamer98 commented 10 months ago

@Swistusmen Just wanted to give an update. I'll preface this at the beginning that after a reinstall of the 0.11.3 https://github.com/datto/dattobd/commit/2ad52cb2f0cf8c347083ca84254e89908385ff9e the memory leak is fixed on my machines.

Further detail: The server in question silently upgraded it's kernel to 5.4.0-156-generic about a week ago. As I did not setup the module with DKMS it was no longer loaded once the kernel was upgraded. I took this opportunity to try and full reinstall and also made sure to setup the 0.11.3 driver through DKMS.

After doing a full reinstall and building the driver with DKMS I am able to report that the memory issues are fixed on my end. I went on to install the driver the same way on another server that was having issues and it's memory is fixed as well.

I wish I could give some further insight on why my initial install didn't work. I have two ideas on what could have possibly caused it. The first one is simply the difference I went about installing the driver. before I simply did a direct make and make install on the driver instead of setting up DKMS. While I did remove and delete old driver I never cleared out the old drivers DKMS configuration that time. Perhaps that caused it to rebuild and reinstall the old driver after a reboot? My understanding is that DKMS should only run when a kernel is upgraded though.

The only other thing I could think of is that the issue was that the agent continued to use an existing local recovery snapshot / COW file from the previous driver version and perhaps that is what caused my issue? Either way I made sure to clean out the existing snapshots and DKMS configs before installing the fixed version this time and it worked.

Sorry for any trouble or concern I brought upon on what seemly ended up being user failure on my end. I do want to give a big thanks for fixing the memory leak as it's been a huge thorn in my side!

Swistusmen commented 10 months ago

It's great to hear everything works for you. Regarding this I am closing this issue, thanks for your cooperation

ScottMonolith commented 10 months ago

@Swistusmen - do you know when a new version of the agent will be released?

Seems I need at least 0.11.3, but on the Datto website the agent version is very different - I show 3.0.16.0 as the agent version.

I need an official release to install on my production systems, but I would love to get Datto back up and running for my Linux hosts!

Thanks!

Swistusmen commented 10 months ago

@ScottMonolith Hi, I would prefer not to talk about official agent release dates because I may be not allowed to, I'm sorry. May I ask why do you need new DLA release? Is it about some OS which is unsupported yet, or you are waiting for some improvement?

ScottMonolith commented 10 months ago

@Swistusmen that is fair that you cannot talk about official agent release dates.

But my problem is this memory leak! If I leave Datto running for more than a week my 20.04 and 22.04 servers start swapping, physical memory is completely consumed. The only solution I've found is a reboot which is not ideal.

This has a massive negative impact on our production servers, and hence I have disabled Datto on these servers until this bug is resolved. Sounds like the bug is resolved, but I need an official agent release to be able to deploy it to my production servers.