Jarli01 / xenorchestra_installer

A simple install script for Xen Orchestra
GNU General Public License v3.0
428 stars 86 forks source link

Systemd service creation causes memory leak #66

Closed ghost closed 4 years ago

ghost commented 4 years ago

Describe the bug By creating a systemd service for xo-server there is a memory leak which causes OOM Killer to kill the NodeJS instance running xo-server once all available memory and swap space in consumed. The causes first run Delta and Continuous Replication jobs to fail with "Interrupted" on VM disks larger than the memory and swap space combined.

To Reproduce Steps to reproduce the behavior:

  1. Create a VM with a disk of 500GB 4GB RAM and 2 vCPU's
  2. Fill the disk with at least 250GB of data
  3. Create at Delta Backup task for the VM
  4. Run the backup task
  5. Task fails with "Interrupted"

Expected behavior Backup task completes without OOM killer killing the xo-server service and restarting it.

Screenshots If applicable, add screenshots to help explain your problem.

Desktop/Server details (please complete the following information):

Provide answers to these questions:

I have found the resolution to this issue and I am only creating this bug report so that you can make edits to your xo_install.sh script to resolve this issue for other users using your installer.

The solution is to replace the service creation section of your installer with the following: yarn global add forever yarn global add forever-service cd ///xo-server/bin/ forever-service install orchestra -r -s xo-server

Jarli01 commented 4 years ago

Can you create a PR that can be merged and tested?

Danp2 commented 4 years ago

@Jarli01 Need to first prove that using systemd is truly an issue

Jarli01 commented 4 years ago

Exactly, this is the first time of heard of this issue.

Jarli01 commented 4 years ago

@jonmike992001

Since you have what appears to be a working solution, please submit a PR so we can test it with your recommended changes.

I've not encountered this issue in the past (with VM's much larger than described) so we need to be very thorough in our testing of this and make sure we're pinpointing the correct problem.

ghost commented 4 years ago

Let me see if I can get a PR for you. Been some time since I've used git to do that.

LPJon commented 4 years ago

Hello, I believe I have to github accounts. I was using the wrong one as "jonmike992001". Please know that I will be submitting the pull request from this account: LPJon

LPJon commented 4 years ago

Ok....I have submitted a pull request. Please advise me if it was incorrectly done.

Danp2 commented 4 years ago

FWIW, I'm trying to replicate this issue with my existing, fully updated, XO VM (Ubuntu 19.10, Node v8.16.0, 2 vCPUs, 2GB). I created a VM with a 500GB drive and filled it with 294GB of data. The export is 85% done, and I haven't observed / encountered any issues. I'll retest with a brand new XO install and report back.

Danp2 commented 4 years ago

Built a new XO VM using the Debian ISO you linked above. Still not seeing the behavior that you've described. Please tell us more about your setup, such as --

Also, how are you ending up with Node 8.17? I got 8.16.2 when I ran the script. Have you tried using this earlier version to see if the issue can be reproduced with it?

LPJon commented 4 years ago

I've installed node 8.17.0 by changing the script to install it, but have used 8.16.0, 8.16.1, and 8.16.2. Didn't matter which version was installed still had the same result.

Servers: (Both Sets of servers have Broadcom NICs) 3 Dell M710HD's/Intel Xeon CPU E5645@2.40GHz with a 4GB\s Fiber Data Backend (Fiber Channel HBA LVM Storage Repository) 2 HP Proliant DL360 G6/Intel Xeon CPU E5530@2.40GHz with Chelsio 10GBe NICs directly attached to NFS v4 server (NFS v4 Repository)

Backup Remote: NFS v4

OS: XCP-ng 8.0 on all servers

VM: 4 vCPU's 8GB RAM 55GB Hard Drive No guest tools Installed and with guest tools installed produced same result.

The build with "forever-service" works with both of these setups perfectly.

Please let me know if there is any specific information you would like. I will surely get it for you.

Tip: Watch the memory in terminal by using "systemctl status xo-server" (Your Current xo_install.sh Script Installation). I have also had this happen with an HP Proliant ML350p Gen 8

LPJon commented 4 years ago

I also wanted to ask. What's the reasoning behind building the service file? Why not use what's in the instructions for installing XOCE. Was there a specific reason a decision was made not to?

Jarli01 commented 4 years ago

IIRC the always running portion wasn't a part of the instructions back when I started this project and we needed a way to make sure XO started at boot etc.

We can certainly re-evaluate it now that it is in the documentation.

Danp2 commented 4 years ago

I wrote this portion of the script and it was designed to mimic the way this was implemented in XOA. Changing it would mean that we would have to also change the update script, and I wouldn't recommend doing this until we can confirm the source of the memory leak.

@LPJon Have you tried a different backup target?

LPJon commented 4 years ago

@Danp2 Yes, I have also tried a local backup target as well as a local backup target which is actually mounted as NFS v4/v3 by the OS and not XOCE. I also changed NICS to Intel from Broadcom in several tests to be sure it wasn't NIC related. I can also say that this hardware is not the latest and greatest. It may be related to older hardware. I have literally spent since November 2019 til February 2020 troubleshooting this problem on all the hardware I've mentioned. What I don't understand is why it doesn't have any issues for you. The installs were out of the box nothing special (No desktop just ssh and system utilities). I tested it with direct to hardware installations and VM installations. It's just very very strange. Can I as what your test hardware is?

@Jarli01 I see... I was only wondering what the reason was and that makes perfect sense.

Danp2 commented 4 years ago

@LPJon I have a PowerEdge R620 that I use for all my work.

LPJon commented 4 years ago

@Danp2 That's in the same classification of servers I'm testing on. Do you test your storage locally or through a network switch? My setup is making backups through a data network. I might see if I can make a diagram for you....may help?

Danp2 commented 4 years ago

@LPJon I'm backing up to an NFS share on NAS. The server and the NAS are attached to the same switch.

LPJon commented 4 years ago

That's the same setup I'm using. The switch is a Cisco 2960s 1GB switch. Ya know I'm just curious....what version of systemd are you using? There was just a new update released I believe yesterday or the day before when Debian 10.3 came out.

Jarli01 commented 4 years ago

The individual packages are going to be system dependent. We always recommend you "stay current" not just with XO-web and XO-server but with your distribution packages as well.

LPJon commented 4 years ago

@Jarli01 That's true and I totally agree. However, I found the solution to my problem just before that update was released. It's possible it may have been fixed in that update. That's the only reason I asked. I'm really not trying to make a stink about all of this and am perfectly fine with you leaving your script alone. My situation might be an edge case for some off the wall reason. I was really more than anything just wanting to spread awareness. I have no idea why this would only effect our server systems across so many different types of servers. There isn't anything special or complicated about our system. In fact it is very generic.

I totally agree that staying up to date is a must. Just dumb-founded.....but I guess as long as it's working for us that's what matters.

Jarli01 commented 4 years ago

@LPJon I completely understand and wasn't looking to insinuate or anything of the like.

There could be a myriad of things that can causes issues, and certainly this script could be one of them. We just do our best to factually pinpoint that issue before we push a change. As it is @Danp2 and I are really the only maintainers of this repo - dan often does the heavy lifting.

Hopefully if systemd updates fix the issue we don't have to change anything (as it would require we also update the updater script.)

Please let us know.

LPJon commented 4 years ago

@Jarli01 I know that you weren't. Just feeling bad for what seems like a waste of your time and Dan's. I will try and test Debian 10.3 again with your script and see what happens. My last install was Debian 10.3 but with the changes I found to work on Debian 10.2 and not with your current script. Debian 10.2 worked with your script but would intermittently fail the backup with "Interrupted". Debian 9 would consistently fail with "Interrupted". Ubuntu 16.04 and 18.04 would also do the same. I'm really starting to believe that either I have missed some small detail or that I'm an edge case and that's just the way that it is. Please don't waste large amounts of time on this because I have even found forums where XOA is considering taking the forever project out of the official documentation here which was back in December of 2019.

Jarli01 commented 4 years ago

No worries, this could be a legitimate issue that needs to be fixed.

It could also be some configuration issue on your switches, servers, storage, vm package, vm network, firewall settings etc.

In short - it is worth at least looking into to see if we can't find a fix or send it upstream to fix a potential issue there.

LPJon commented 4 years ago

@Danp2 Are you the Danp from the XCP-ng forums?

Danp2 commented 4 years ago

@LPJon Yep... that's me. 😉

Jarli01 commented 4 years ago

@LPJon Yep... that's me. 😉

He's way cuter in person, lol.

Danp2 commented 4 years ago

@Jarli01 👅