sniperkitten commented 5 years ago

Hello!

Recently I noticed many defunct processes on xenserver after executing NAUBackup

root 31292 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31331 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31351 807 0 07:12 ? 00:00:00 [LVMSR] <defunct> root 31370 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct> root 31401 807 0 02:40 ? 00:00:00 [LVMSR] <defunct> root 31523 807 0 04:07 ? 00:00:00 [LVMSR] <defunct> root 31574 807 0 02:52 ? 00:00:00 [LVMSR] <defunct> root 31577 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct>

They all have same parent:

root 807 1 0 Jun25 ? 00:04:45 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><me root 862 807 0 Jun25 ? 00:00:00 [LVMSR] <defunct>

Found following errors in SMlog:

Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog" Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] ' Jun 25 13:15:28 xenserver-enginatics1 SM: [27322] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog ! Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog" Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] ' Jun 25 13:20:31 xenserver-enginatics1 SM: [30686] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog ! Jun 25 15:12:23 xenserver-enginatics1 SM: [666] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog" Jun 25 15:12:23 xenserver-enginatics1 SM: [666] ' Jun 25 15:12:23 xenserver-enginatics1 SM: [666] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/cc4d186a-1846-47eb-9dcb-ea48bccd6d85.cbtlog ! Jun 25 15:12:23 xenserver-enginatics1 SM: [855] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Failed to find logical volume "VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog" Jun 25 15:12:23 xenserver-enginatics1 SM: [855] ' Jun 25 15:12:23 xenserver-enginatics1 SM: [855] Ignoring exception for LV check: /dev/VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c/71494603-83b2-403f-83cc-329cadce9d6f.cbtlog !

Following messages taken from NAUBackup log with same VDI IDs:

2019-06-25-(15:12:22) - The following items are about to be destroyed 2019-06-25-(15:12:22) - VM : 2a0bc086-1ef0-f404-3660-a58b83573c6d (RESTORE_EBS r1213a) 2019-06-25-(15:12:22) - VDI: cc4d186a-1846-47eb-9dcb-ea48bccd6d85 (EBS r1213a upgraded OEL 6 root "/" mountpoint) 2019-06-25-(15:12:22) - VDI: 71494603-83b2-403f-83cc-329cadce9d6f (EBS r1213a upgraded OEL 6 /d01 EBS mountpoint)

Xen Server and Xen SDK version: 8.0 NAUBackup version: V3.24

Please help me with the investigation.

Thank you!

NAUbackup commented 5 years ago

Did you check to see if the snapshots were correctly removed after the backup process completed? Unfortunately, I have not had a chance to test this thoroughly yet on an 8.0 installation. Did you see anything like this under XS 7.X?

sniperkitten commented 5 years ago

Hi

Checked snapshots in Xen Center - they were removed correctly. I didn't happen in XenServer 7.6, we upgraded at the beginning of this month to 8.0.

On Wed, 26 Jun 2019 at 19:00, NAUbackup notifications@github.com wrote:

Did you check to see if the snapshots were correctly removed after the backup process completed? Unfortunately, I have not had a chance to test this thoroughly yet on an 8.0 installation. Did you see anything like this under XS 7.X?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/NAUbackup/VmBackup/issues/91?email_source=notifications&email_token=AINTJENJFULR6S4DFYFRSDTP4OHCZA5CNFSM4H3PN5M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYUAGOI#issuecomment-505938745, or mute the thread https://github.com/notifications/unsubscribe-auth/AINTJEJV7GRXX46IZODFDSDP4OHCZANCNFSM4H3PN5MQ .

NAUbackup commented 5 years ago

Not sure how such an LVM call would have been initiated via VmBackup to result in this. I will try to get a test instance running soon under 8.0 and see if I can replicate this. Sorry this is showing up. Is everything functional otherwise?

sniperkitten commented 5 years ago

Yes, everything else seems to be working fine. Backups finish successfully regardless of this issue. I restored one of the VMs from such recent backup and it started normally.

sniperkitten commented 5 years ago

Hi

Today after number of zombie processes reached 10 000, I killed their parent and it worked as workaround - all zombies disappeared. Here's the process I killed:

root 807 1 0 Jun25 ? 00:44:32 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:11a4f521-1bcb-4852-9218-444fadda3540</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>device</name><value>/dev/disk/by-id/ata-Micron_5200_MTFDDAK3T8TDC_18531FF5401F</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:5aa061c5-eeda-485d-8786-820f56a6e90e</value></member><member><name>sr_ref</name><value>OpaqueRef:3c102b39-623d-4ee1-8065-4f5e4153d3af</value></member><member><name>sr_uuid</name><value>bac4ccac-5680-915a-6f84-b975379aaa1c</value></member><member><name>vdi_ref</name><value>OpaqueRef:7b5c22d8-6a75-4f55-b1db-387e967aaa3a</value></member><member><name>vdi_location</name><value>cc4d186a-1846-47eb-9dcb-ea48bccd6d85</value></member><member><name>vdi_uuid</name><value>cc4d186a-1846-47eb-9dcb-ea48bccd6d85</value></member><member><name>subtask_of</name><value>DummyRef:|7a421cce-426e-4034-a7b9-7d8f2b194932|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall>

NAUbackup commented 5 years ago

Seems that the parent process doesn't terminate normally, and exacerbated the issue. Would be curious if the same thing happens if you run another backup process. I do not see anything like this under 7.X, hence it is clearly related to changes in 8.0.

sniperkitten commented 5 years ago

Hi

Reproduced issue after getting rid of zombie processes. First backup completed normally, so I initiated the second. Snapshot got deleted successfully, still this process left hanging along with its childs:

root 18663 1 0 08:43 ? 00:00:04 /usr/bin/python /opt/xensource/sm/LVMSR <methodCall><methodName>vdi_delete</methodName><params><param><value><struct><member><name>host_ref</name><value>OpaqueRef:11a4f521-1bcb-4852-9218-444fadda3540</value></member><member><name>command</name><value>vdi_delete</value></member><member><name>args</name><value><array><data/></array></value></member><member><name>device_config</name><value><struct><member><name>SRmaster</name><value>true</value></member><member><name>device</name><value>/dev/disk/by-id/ata-Micron_5200_MTFDDAK3T8TDC_18531FF5401F</value></member></struct></value></member><member><name>session_ref</name><value>OpaqueRef:5e3e4027-3a05-4ff6-889d-4d1f9fb2b79b</value></member><member><name>sr_ref</name><value>OpaqueRef:3c102b39-623d-4ee1-8065-4f5e4153d3af</value></member><member><name>sr_uuid</name><value>bac4ccac-5680-915a-6f84-b975379aaa1c</value></member><member><name>vdi_ref</name><value>OpaqueRef:06fa189d-1500-4bae-a008-3d1ad0a11b57</value></member><member><name>vdi_location</name><value>03d9f8ee-43b9-4583-bf42-0194a86acbe3</value></member><member><name>vdi_uuid</name><value>03d9f8ee-43b9-4583-bf42-0194a86acbe3</value></member><member><name>subtask_of</name><value>DummyRef:|53937de2-c14b-4ed4-bb20-1a93802e371f|VDI.destroy</value></member><member><name>vdi_on_boot</name><value>persist</value></member><member><name>vdi_allow_caching</name><value>false</value></member></struct></value></param></params></methodCall> root 19245 18663 0 09:01 ? 00:00:00 [LVMSR] <defunct> root 21032 18663 0 09:02 ? 00:00:00 [LVMSR] <defunct>

Killed it again and zombie processes disappeared also.

sniperkitten commented 4 years ago

Hi

Please tell is there any news?

Thank you

NAUbackup commented 4 years ago

I am having issues getting the NFS network connected to my tests servers and need to work with out network operations group. Until that's rectified, I won't have a CH 8.0 system I can test with. Sorry it's taking this long.

BogdanRudas commented 4 years ago

We are using own backup scripts to exports VM snapsots, but faced same issue. Snapshots looks deleted from Xencenters, but LVMSR space is not freed. This apply to Xenserver XS71ECU2015 and most probably to Hypervisor 8. We have no evidences of this in XCP-ng 7.6 (but probably need to deploy XCP-ng 8 or just update our's 7.6 which are not as big a installation of licensed Citrix products)

NAUbackup commented 4 years ago

Make sure you don't have too many snapshots already and there is enough room to be able to take an additional snapshot. If the SR is too full, the coalesce process will sometimes also fail to work. You can try manually rescanning the SR, or go and delete snapshots manually first to ensure there is enough free space. Worst case you may have to shuffle VMs around to free up some space and also potentially force the coalescing to kick in.

BogdanRudas commented 4 years ago

I know about such restrictions. In may case I have at least triple of VM size free on that SR. Process tree looks like.

Looking through our monitoring, it seems that the issue was caused by some update that I've applied on Xenserver 7.1 before July 28th. Since that time it triggering alerts from time to time and max number of processes I've seen in monitoring was 3k+ (and I don't finished monitoring check yet). I submited a case to Citrix. Once I killed that parent process and all zombies disappeared. I'm sorry for discussing this here, but I think it is Xenserver issue and not related to NAUbackup itself.

sniperkitten commented 4 years ago

Thank you for feedback, Bogdan! Really looks like Xenserver problem. Can you please share solution from Citrix if they manage to resolve it.

sniperkitten commented 4 years ago

Bogdan, did you find a way to manually free space as a workaround? If yes, please share, Thank you

BogdanRudas commented 4 years ago

There are two ways:

Gracefully move all VMs using Storage Xenmotion to other host, reboot, do sr_scan.
See pstree, kill parrent LVMSR vdi_delete process, then zmobies disappears and finally do sr_scan. I'm unsure it this way is safe, but it worked for me while ago.

Finally you have a look of xe vdi-list and clean up orgphaed VDI. There are many discussion about this in Google and especially on Citrix forums.

sniperkitten commented 4 years ago

Actually, sr_scan also hanged and caused zombie processes. Will try to find orphaned VDI

NAUbackup commented 4 years ago

Unfortunately, the coalesce_leaf command was removed from later versions of XenServer/Citrix Hypervisor and hence the clean-up process of orphaned VDIs is now by and large a manual process. Moving VMs to other SRs and rexscanning or even reinitializing the original SR is indeed one option. When manually deleting VDIs, make sure they're not a template or some other form of VDI that may not necessarily have an associated parent.

BogdanRudas commented 4 years ago

I've found some correlation for my case: The issue with number of zombies started when I run the backup first time after installation of listed patched (installed same day): XS71ECU2009 XS71ECU2010 XS71ECU2011

Also I wound that issue is frequently reproduces on ourw Elasticearch virtual machines. Some other types of the VMs are affected too, bu ELK with it's moderate but constant load is good debug sample.

I will re-image one host, apply patches up to XS71ECU2008, put ELK VM and see if zombiles will appears again. Then I will start applying patches one by one untill the LVMSr will suck again.

Regarding Citix support, I was told to try some product https://citrixready.citrix.com/category-results.html?search=backup I will go with XOA against fully patched XS71CU2 on HCL-approved hardware. I bet any other backup software will do nearly same as me, thus I should reproduce the issue on ELK VM too.

BogdanRudas commented 4 years ago

Well, I was able to reproduce the issue using "Full backup" from XOA and even using Storage Xenmotion from one server to another. I'm going to reinstall Xenserver 7.1, then I will apply patches one by one.

As a quick and dirty hack I do VM pause (much faster then suspend) against all VMs on hosts and wait ~15-30 minutes until coalescing will do it's job. It is possible to narrow list of VM to be paused with review VHD tree for particular VM, see syntax here: https://support.citrix.com/article/CTX139224

NAUbackup commented 4 years ago

Note that with a vhd-util scan that just because no parent is found, doesn't necessarily mean it is an orphan; it might just be a template. Hence, one should be careful about deleting any instances.

sniperkitten commented 4 years ago

Thank you for the effort, Bogdan!

I followed your advice to get rid of VDIs causing the issue.

1) I identified VMs that had hidden VDIs:

[root@xenserver1 EBS demo (12.2.8)]# vhd-util scan -f -m "VHD-*" -l VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c -p

vhd=VHD-34060004-645d-4a4e-84dd-832f058e5adf capacity=107374182400 size=40647000064 hidden=1 parent=none vhd=VHD-b39ca920-73fe-499a-83dd-d3de8a8d5b67 capacity=107374182400 size=107592286208 hidden=0 parent=VHD-34060004-645d-4a4e-84dd-832f058e5adf

2) Then I stopped VMs and copied them to the same SR as it had enough space.

3) After starting copied VM and healthcheck I removed source VM. Hidden VDI were gone.

I'm not performing backups for now as I don't want the issue to be repeated.

NAUbackup commented 4 years ago

Have you considered upgrading to 3.25 and trying with a very small subset of copied VMs, perhaps on non-production SR? Some issues with snapshots can arise if the snapshot chain is too long or there are space issues on the SR on which the VMs reside.

sniperkitten commented 4 years ago

I was already on 3.25 and still having issues. So I will wait for further news from Bogdan.

Also, looks like VmBackup.py 3.25 has older version inside, can you please check:

NAUVmBackup/VmBackup.py

V3.24 June 2018

NAUbackup commented 4 years ago

@sniperkitten - fixed in-line version number and comments in source code. Thanks for pointing that out!

sniperkitten commented 4 years ago

Hi Is there any news regarding this issue? Thank you

NAUbackup commented 4 years ago

Sorry, I have not had any time to investigate this further. In general, orphans like this are a result of lack of space in which cleanup is possible, hence the suggestion to test with a small subset of VMs and plenty of storage space to work with. This is also possibly a fundamental XS/CH issue as orphaned VDIs have been a problem for years.

BogdanRudas commented 4 years ago

Hello!

I've checked the issue against patches up to XS71ECU2016, will have another call with Citrix. I hope it will be a call with next level of their support.

I've reproduced this with Xen Orchestra as well (because Cirix want to discuss only "CitrixReady" solutions).

BogdanRudas commented 4 years ago

I've just finished a call with support regarding our support case. From this call I understand that they have couple clients with similar problem and will escalate cases further. I hope there should be a hotfix one day.

sniperkitten commented 4 years ago

Thank you so much Bogdan for sharing this!

чт, 19 сент. 2019 г., 3:56 BogdanRudas notifications@github.com:

I've just finished a call with support regarding our support case. From this call I understand that they have couple clients with similar problem and will escalate cases further. I hope there should be a hotfix one day.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NAUbackup/VmBackup/issues/91?email_source=notifications&email_token=AINTJEPWXJGUSIS4WLWTVPLQKNLGFA5CNFSM4H3PN5M2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DCGOI#issuecomment-533078841, or mute the thread https://github.com/notifications/unsubscribe-auth/AINTJEP5Y35BK2AOPTXKRX3QKNLGFANCNFSM4H3PN5MQ .

BogdanRudas commented 4 years ago

I've testing a private hotfix from Citrix since last week.

sniperkitten commented 4 years ago

Good news! Did it work for you?

BogdanRudas commented 4 years ago

Please, try this https://support.citrix.com/article/CTX262019

I was given with private hotfix that fixed zomobies, but the coalescing process that works for many days is still an issue.

sniperkitten commented 4 years ago

Thank you for update, Bogdan! Looks like update not available for my company's account and I have to wait for the fix to be made publicly available Also it's for version 7.1. Hope Citrix will include it in some public update for Hypervisor 8.0

BogdanRudas commented 4 years ago

@sniperkitten I guess they will release a fix in CH8.1

sniperkitten commented 4 years ago

We installed CH8.1 and it seems to fix the issue with zombie processes. Thanks everyone for help.

BogdanRudas commented 4 years ago

Did it improved coalescing process itself or just fixed zombies (like an update for XS7.1 ? )

sniperkitten commented 4 years ago

I tested it and 8.1 update only fixed zombies. I created test snapshots and deleted them right away. For relatively small disks ~100G coalescing happened successfully. However for one of 300G VDI I found following error in /var/log/SMlog:

SMGC: [11948] SR bac4 ('SDD 4TB') (14 VDIs in 12 VHD trees): showing only VHD trees that changed: SMGC: [11948] eea5e8a0VHD SMGC: [11948] 0b1a110dVHD SMGC: [11948] SMGC: [11948] ~~~~~~~~~~~~~~~~~~~~ SMGC: [11948] SMGC: [11948] E X C E P T I O N SMGC: [11948] SMGC: [11948] gc: EXCEPTION <class 'SR.SROSError'>, Gave up on leaf coalesce after leaf grew bigger than before snapshot taken [opterr=VDI=0b1a110dVHD] SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 2769, in gc SMGC: [11948] _gc(None, srUuid, dryRun) SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 2654, in _gc SMGC: [11948] _gcLoop(sr, dryRun) SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 2636, in _gcLoop SMGC: [11948] sr.coalesceLeaf(candidate, dryRun) SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 1586, in coalesceLeaf SMGC: [11948] self._coalesceLeaf(vdi) SMGC: [11948] File "/opt/xensource/sm/cleanup.py", line 1789, in _coalesceLeaf SMGC: [11948] raise xs_errors.XenError('LeafGCSkip', opterr='VDI=%s' % vdi) SMGC: [11948] SMGC: [11948] ~~~~~~~~~~~~~~~~~~~~ SMGC: [11948] SR bac4ccac-5680-915a-6f84-b975379aaa1c: ERROR

Used this command to find leafs

vhd-util scan -f -m "VHD-*" -l VG_XenStorage-bac4ccac-5680-915a-6f84-b975379aaa1c -p

vhd=VHD-bda8bdc2-d549-4b77-b460-63bc23219d61 capacity=10737418240 size=10368319488 hidden=1 parent=none vhd=VHD-c987bb4d-3d47-4412-b0cc-fd5e1535d505 capacity=10737418240 size=10766778368 hidden=0 parent=VHD-bda8bdc2-d549-4b77-b460-63bc23219d61 vhd=VHD-eea5e8a0-32a4-4b07-9dfa-d12a35410f66 capacity=311385128960 size=304514859008 hidden=1 parent=none vhd=VHD-0b1a110d-b3a8-47ec-b7ec-2b4ef37cda1a capacity=311385128960 size=312001691648 hidden=0 parent=VHD-eea5e8a0-32a4-4b07-9dfa-d12a35410f66

I tried your fix Bogdan from https://github.com/xcp-ng/xcp/issues/298 I set following parameters in /opt/xensource/sm/cleanup.py

LIVE_LEAF_COALESCE_MAX_SIZE = 1024 1024 1024 LIVE_LEAF_COALESCE_TIMEOUT = 300

After rerunning sr_scan, coalescing successfully completed. Then I tried to create and delete snapshot again. Coalescing successfully completed this time.

NAUbackup commented 4 years ago

I wonder if it's related to this issue: https://bugs.xenserver.org/browse/XSO-966

NAUbackup / VmBackup

Failed to find logical volume messages in SMlog and defunct processes #91

NAUVmBackup/VmBackup.py

V3.24 June 2018