apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2k stars 1.09k forks source link

SystemVM agent seen in Disconnected or Alert state on XenServer #3853

Closed rohityadavcloud closed 4 years ago

rohityadavcloud commented 4 years ago

On slow/resource-constraint XenServer environments, when SSVM/CPVM starts TLS certificates are provisioned via the default root CA provider which sometimes may fail the initial setup or fail due to not enough system entropy. Due to this the agent will then fail to connect and maybe stuck in Disconnected/Alert state and this would be seen:

Screenshot from 2020-01-30 13-20-05

The management server logs would report the SSVM/CPVM client was presenting invalid certificates, for example:

2020-01-30 07:31:43,843 ERROR [c.c.u.n.Link] (AgentManager-SSLHandshakeHandler-165:null) (logid:) SSL error caught during wrap data: Empty server certificate chain, for local address=/10.2.3.131:8250, remote address=/10.2.8.51:39178.
2020-01-30 07:31:43,858 INFO  [c.c.a.m.AgentManagerImpl] (AgentManager-Handler-2:null) (logid:) Connection from /10.2.8.51 closed but no cleanup was done.

Note: the issue is not always reproducible.

ISSUE TYPE
COMPONENT NAME
SSVM, CPVM
CLOUDSTACK VERSION
4.14/master with JDK11
DaanHoogland commented 4 years ago

@rhtyd this sounds like it is not jdk specific. Have we never seen this in jdk8?

rohityadavcloud commented 4 years ago

You're right @DaanHoogland but it could be a combination of things why this is sometimes reproducible, but no concrete facts to blame jdk11 yet. One workaround fix I've done in #3601 is to stop the cloudstack agent before the key/crt is setup and start it after it is imported to the keystore; that would cause some CPU contention to be reduced.

rohityadavcloud commented 4 years ago

Not seen again.

alexandru-bagu commented 2 years ago

This is still happening every once in a while when creating system vms (console or storage). Hypervisor Xcp-ng 8.20, CS 4.16.0.

rohityadavcloud commented 2 years ago

Hi @alexandru-bagu is this a test env running in a nested env (for ex. running xcp/xenserver as VMs on some other hypervisor?). You can try tuning (increasing) the ping.interval and ping.timeout global settings.

alexandru-bagu commented 2 years ago

Not a test environment, it's my company's live environment, nothing nested. As for the ping interval/timeout I don't believe that will help because what usually happens is it tries for 100 times to connect to the agent (in management logs I see something like "attempt x of 100 to connect failed" as well as the logs you presented in the initial post). After the 100 attempts fail (takes up to 10 minutes or so) the system vm will be recreated.

This issue doesn't happen all the time. Sometimes systemvms will just work however sometimes they need to be recreated a few times.