Virtual Routers system vm is no starting

apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform

https://cloudstack.apache.org/

Apache License 2.0

2.03k stars 1.1k forks source link

Virtual Routers system vm is no starting #6842

Closed canghai908 closed 1 year ago

canghai908 commented 1 year ago

ISSUE TYPE

Bug Report

COMPONENT NAME

Virtual Routers vm is no starting

CLOUDSTACK VERSION

OS:centos 7.9 cloudstack:4.17.0

CONFIGURATION

OS / ENVIRONMENT

CEPH 15.6 cloudstack:4.17.0 libvirt-4.5.0-36.el7_9.5.x86_64

SUMMARY

STEPS TO REPRODUCE

When I remote the kvm hypervisor,thg virtual routers system vm is no starting. system vm up ok virtual routers system vm is hang connect to vnc

boring-cyborg[bot] commented 1 year ago

Thanks for opening your first issue here! Be sure to follow the issue template!

weizhouapache commented 1 year ago

@canghai908 are the systemvms running on same ceph storage pool ?

canghai908 commented 1 year ago

yes， the systemvms running on same ceph storage pool。

Wei Zhou @.***> 于2022年10月23日周日 17:13写道：

@canghai908 https://github.com/canghai908 are the systemvms running on same ceph storage pool ?

— Reply to this email directly, view it on GitHub https://github.com/apache/cloudstack/issues/6842#issuecomment-1288060214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALCH4XAMLRS456YV3PCWCDWET6T5ANCNFSM6AAAAAARL26SNY . You are receiving this because you were mentioned.Message ID: @.***>

weizhouapache commented 1 year ago

@canghai908

check if cloudstack-agent and management have all been upgrade to same version (4.17.0)
create a new network and check again ?

canghai908 commented 1 year ago

@weizhouapache cloudstack-agent and management are same version (4.17.0).Now I update cloudstack agent and management to 4.17.1.0 the virtual routers system is work ok,but my some vm is hang like this.I don't know if it's cloudstack's problem or ceph's problem in the morning。

canghai908 commented 1 year ago

@weizhouapache My problem solved！The problem is due to ceph's locks。look this article https://www.cnblogs.com/zphj1987/p/14155644.html
I would like to know if cloudstack needs the exclusive-lock feature of rbd, and if not, can I disable this feature to avoid this problem from happening。

weizhouapache commented 1 year ago

@weizhouapache My problem solved！The problem is due to ceph's locks。look this article https://www.cnblogs.com/zphj1987/p/14155644.html I would like to know if cloudstack needs the exclusive-lock feature of rbd, and if not, can I disable this feature to avoid this problem from happening。

@canghai908 good ! thanks for update and sharing

for your question, maybe @wido can help.

wido commented 1 year ago

@weizhouapache My problem solved！The problem is due to ceph's locks。look this article https://www.cnblogs.com/zphj1987/p/14155644.html I would like to know if cloudstack needs the exclusive-lock feature of rbd, and if not, can I disable this feature to avoid this problem from happening。

@canghai908 good ! thanks for update and sharing

for your question, maybe @wido can help.

We should not want to disable exclusive-locking as this is an important feature of Ceph to prevent data corruption.

Who is holding the lock @canghai908 ? Why is another client locking this image on Ceph?

canghai908 commented 1 year ago

@wido the vm images on ceph.The image is locke by cloudstack computer node The reason for the lockup is due to network issues and unexpected restart of the vm hypervisor machine. I rm the image lock The virtual machine starts normally

wido commented 1 year ago

Was the VM running on that compute node before? Did it crash?

the exclusive lock should timeout after a couple of minutes after which you can start the VM on a different host.

canghai908 commented 1 year ago

@wido

Was the VM running on that compute node before? Did it crash? yes.VM running on that compute node,the node is crash and the ceph cluster network also has a problem. In my case the lock was not timeout. After manual remove lock it was fine. Excuse me, where is the timeout time of the lock configured?

wido commented 1 year ago

@wido

Was the VM running on that compute node before? Did it crash? yes.VM running on that compute node,the node is crash and the ceph cluster network also has a problem. In my case the lock was not timeout. After manual remove lock it was fine. Excuse me, where is the timeout time of the lock configured?

How long did you wait? And was the other node really down?

Because the exclusive lock should be handed over to another node if the old one goes down. If that doesn't happen Ceph blocks because there is a potential data corruption risk.

canghai908 commented 1 year ago

@wido

How long did you wait? And was the other node really down?

wait for 12 hours。other node really down by unexpected。

wido commented 1 year ago

@wido

How long did you wait? And was the other node really down?

wait for 12 hours。other node really down by unexpected。

That is odd, the exclusive lock should timeout and be handed over to the other client.

If that doesn't work something is wrong, but disabling exclusive locks should not be done.

That said, this seems like a Ceph issue and not a CloudStack issue.

Do your Ceph clients have the proper authx capabilities in Ceph? See: https://docs.ceph.com/en/latest/rbd/rbd-exclusive-locks/

In order for blocklisting to work, the client must have the osd blocklist capability. This capability is included in the profile rbd capability profile, which should generally be set on all Ceph [client identities](https://docs.ceph.com/en/latest/rados/operations/user-management/#user-management) using RBD.

canghai908 commented 1 year ago

@wido Thank you!I will check the ceph cluster.