apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.05k stars 1.1k forks source link

After upgrade from 4.18.0 to 4.18.1 cloudstack-agent not starting #8604

Closed yashi4engg closed 3 months ago

yashi4engg commented 8 months ago
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
SUMMARY

We are trying to upgrade from 4.18.0 to 4.18.1. We have upgarde ,management node and its up with systemVM version 4.18.1 . While upgrading hypervisors cloudstack-agent is not starting afetr package upgrade. Below are logs :- 2024-02-02 14:26:39,507 INFO [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.1.0 2024-02-02 14:26:39,508 INFO [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties 2024-02-02 14:26:39,546 INFO [cloud.agent.AgentShell] (main:null) (logid:) Defaulting to using properties file for storage 2024-02-02 14:26:39,546 INFO [cloud.agent.AgentShell] (main:null) (logid:) Defaulting to the constant time backoff algorithm 2024-02-02 14:26:39,580 INFO [cloud.utils.LogUtils] (main:null) (logid:) log4j configuration found at /etc/cloudstack/agent/log4j-cloud.xml 2024-02-02 14:26:39,581 INFO [cloud.agent.AgentShell] (main:null) (logid:) Using default Java settings for IPv6 preference for agent connection 2024-02-02 14:26:39,655 INFO [cloud.agent.Agent] (main:null) (logid:) id is 0 2024-02-02 14:26:39,665 ERROR [kvm.resource.LibvirtComputingResource] (main:null) (logid:) uefi properties file not found due to: Unable to find file uefi.properties. 2024-02-02 14:26:39,706 INFO [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Failed to find passphrase for keystore: cloud.jks 2024-02-02 14:26:39,709 INFO [kvm.resource.LibvirtConnection] (main:null) (logid:) No existing libvirtd connection found. Opening a new one 2024-02-02 14:26:39,799 WARN [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Ignoring libvirt error. org.libvirt.LibvirtException: Network not found: no network with matching name 'default' at org.libvirt.ErrorHandler.processError(Unknown Source) at org.libvirt.ErrorHandler.processError(Unknown Source) at org.libvirt.Connect.networkLookupByName(Unknown Source) at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.configure(LibvirtComputingResource.java:1081) at com.cloud.agent.Agent.(Agent.java:190) at com.cloud.agent.AgentShell.launchNewAgent(AgentShell.java:452) at com.cloud.agent.AgentShell.launchAgentFromClassInfo(AgentShell.java:431) at com.cloud.agent.AgentShell.launchAgent(AgentShell.java:415) at com.cloud.agent.AgentShell.start(AgentShell.java:511) at com.cloud.agent.AgentShell.main(AgentShell.java:541) 2024-02-02 14:26:39,916 INFO [kvm.resource.LibvirtComputingResource] (main:null) (logid:) IO uring driver for Qemu: disabled 2024-02-02 14:26:39,977 INFO [kvm.storage.KVMStoragePoolManager] (main:null) (logid:) adding storage adaptor for com.cloud.hypervisor.kvm.storage.LinstorStorageAdaptor 2024-02-02 14:26:39,980 INFO [kvm.storage.KVMStoragePoolManager] (main:null) (logid:) adding storage adaptor for com.cloud.hypervisor.kvm.storage.StorPoolStorageAdaptor 2024-02-02 14:26:39,980 WARN [kvm.storage.KVMStoragePoolManager] (main:null) (logid:) Duplicate StorageAdaptor type PowerFlex, not loading com.cloud.hypervisor.kvm.storage.ScaleIOStorageAdaptor 2024-02-02 14:26:39,980 INFO [kvm.storage.KVMStoragePoolManager] (main:null) (logid:) adding storage adaptor for com.cloud.hypervisor.kvm.storage.IscsiAdmStorageAdaptor 2024-02-02 14:26:39,981 INFO [kvm.resource.LibvirtComputingResource] (main:null) (logid:) No libvirt.vif.driver specified. Defaults to BridgeVifDriver. 2024-02-02 14:26:40,116 INFO [cloud.serializer.GsonHelper] (main:null) (logid:) Default Builder inited. 2024-02-02 14:26:40,116 INFO [kvm.resource.LibvirtComputingResource] (main:null) (logid:) iscsi session clean up is disabled 2024-02-02 14:26:40,118 INFO [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Skipping the memory balloon stats period setting, since there are no VMs (active Libvirt domains) on this host. 2024-02-02 14:26:40,119 INFO [kvm.resource.LibvirtComputingResource] (main:null) (logid:) The [vm.memballoon.stats.period] property is set to '0', this prevents memory statistics from being displayed correctly. Adjust (increase) the value of this parameter to correct this.

We are using kvm native bridge as networking.

On management server we can see error in exception - 2024-02-02 14:46:06,722 DEBUG [c.c.a.m.AgentManagerImpl] (AgentConnectTaskPool-1175:ctx-9a210df2) (logid:139886e2) Failed to handle host connection: java.lang.IllegalArgumentException: Can't add host: x.x.x.x with hostOS, "Red Hat Enterprise Linux"into a cluster, in which there are "Oracle Linux Server" hosts added.

STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS
weizhouapache commented 8 months ago

@yashi4engg this is same issue as https://github.com/apache/cloudstack/issues/8026 you may find the workaround in the comments.

yashi4engg commented 8 months ago

@weizhouapache -- we tried workarroun by replace redhat-release content with oracle-release file and now able to add node to cluster ...But somehow now unable to create VM with below error ...even we have enough resources .

2024-02-05 14:44:19,773 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-14:ctx-5789063c job-295587) (logid:5f922a22) Unexpected exception while executing org.apache.cloudstack.api.command.admin.vm.DeployVMCmdByAdmin com.cloud.utils.exception.CloudRuntimeException: Unable to start a VM [5ece1bb3-22c0-4482-86b3-eff04b2b7e38] due to [Unable to create a deployment for VM instance {"id":89770,"instanceName":"xyz-VM","type":"User","uuid":"5ece1bb3-22c0-4482-86b3-eff04b2b7e38"}]. at com.cloud.vm.VirtualMachineManagerImpl.start(VirtualMachineManagerImpl.java:841) at org.apache.cloudstack.engine.cloud.entity.api.VMEntityManagerImpl.deployVirtualMachine(VMEntityManagerImpl.java:246) at org.apache.cloudstack.engine.cloud.entity.api.VirtualMachineEntityImpl.deploy(VirtualMachineEntityImpl.java:214) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:5401) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:5251) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:4876) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:4865) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:107) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175) at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215) at com.sun.proxy.$Proxy185.startVirtualMachine(Unknown Source) at org.apache.cloudstack.api.command.user.vm.DeployVMCmd.execute(DeployVMCmd.java:754) at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163) at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:112) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for VM instance {"id":89770,"instanceName":"xyz-VM","type":"User","uuid":"5ece1bb3-22c0-4482-86b3-eff04b2b7e38"}Scope=interface com.cloud.dc.DataCenter; id=1 at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:1226) at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:5412) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... 18 more 2024-02-05 14:44:19,778 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-14:ctx-5789063c job-295587) (logid:5f922a22) Complete async job-295587, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Unable to start a VM [5ece1bb3-22c0-4482-86b3-eff04b2b7e38] due to [Unable to create a deployment for VM instance {"id":89770,"instanceName":"xyz-VM","type":"User","uuid":"5ece1bb3-22c0-4482-86b3-eff04b2b7e38"}]."}

yashi4engg commented 8 months ago

On hypervisor side we can see below error in agent.logs - 2024-02-05 19:44:07,772 INFO [kvm.resource.LibvirtConnection] (main:null) (logid:) No existing libvirtd connection found. Opening a new one 2024-02-05 19:44:07,886 WARN [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Ignoring libvirt error. org.libvirt.LibvirtException: Network not found: no network with matching name 'default' at org.libvirt.ErrorHandler.processError(Unknown Source) at org.libvirt.ErrorHandler.processError(Unknown Source) at org.libvirt.Connect.networkLookupByName(Unknown Source) at com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.configure(LibvirtComputingResource.java:1081) at com.cloud.agent.Agent.(Agent.java:190) at com.cloud.agent.AgentShell.launchNewAgent(AgentShell.java:452) at com.cloud.agent.AgentShell.launchAgentFromClassInfo(AgentShell.java:431) at com.cloud.agent.AgentShell.launchAgent(AgentShell.java:415) at com.cloud.agent.AgentShell.start(AgentShell.java:511) at com.cloud.agent.AgentShell.main(AgentShell.java:541)

weizhouapache commented 8 months ago

@weizhouapache -- we tried workarroun by replace redhat-release content with oracle-release file and now able to add node to cluster ...But somehow now unable to create VM with below error ...even we have enough resources .

2024-02-05 14:44:19,773 ERROR [c.c.a.ApiAsyncJobDispatcher] (API-Job-Executor-14:ctx-5789063c job-295587) (logid:5f922a22) Unexpected exception while executing org.apache.cloudstack.api.command.admin.vm.DeployVMCmdByAdmin com.cloud.utils.exception.CloudRuntimeException: Unable to start a VM [5ece1bb3-22c0-4482-86b3-eff04b2b7e38] due to [Unable to create a deployment for VM instance {"id":89770,"instanceName":"xyz-VM","type":"User","uuid":"5ece1bb3-22c0-4482-86b3-eff04b2b7e38"}]. at com.cloud.vm.VirtualMachineManagerImpl.start(VirtualMachineManagerImpl.java:841) at org.apache.cloudstack.engine.cloud.entity.api.VMEntityManagerImpl.deployVirtualMachine(VMEntityManagerImpl.java:246) at org.apache.cloudstack.engine.cloud.entity.api.VirtualMachineEntityImpl.deploy(VirtualMachineEntityImpl.java:214) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:5401) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:5251) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:4876) at com.cloud.vm.UserVmManagerImpl.startVirtualMachine(UserVmManagerImpl.java:4865) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:344) at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:198) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:163) at org.apache.cloudstack.network.contrail.management.EventUtils$EventInterceptor.invoke(EventUtils.java:107) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175) at com.cloud.event.ActionEventInterceptor.invoke(ActionEventInterceptor.java:52) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:175) at org.springframework.aop.interceptor.ExposeInvocationInterceptor.invoke(ExposeInvocationInterceptor.java:97) at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:186) at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:215) at com.sun.proxy.$Proxy185.startVirtualMachine(Unknown Source) at org.apache.cloudstack.api.command.user.vm.DeployVMCmd.execute(DeployVMCmd.java:754) at com.cloud.api.ApiDispatcher.dispatch(ApiDispatcher.java:163) at com.cloud.api.ApiAsyncJobDispatcher.runJob(ApiAsyncJobDispatcher.java:112) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.runInContext(AsyncJobManagerImpl.java:620) at org.apache.cloudstack.managed.context.ManagedContextRunnable$1.run(ManagedContextRunnable.java:48) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext$1.call(DefaultManagedContext.java:55) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.callWithContext(DefaultManagedContext.java:102) at org.apache.cloudstack.managed.context.impl.DefaultManagedContext.runWithContext(DefaultManagedContext.java:52) at org.apache.cloudstack.managed.context.ManagedContextRunnable.run(ManagedContextRunnable.java:45) at org.apache.cloudstack.framework.jobs.impl.AsyncJobManagerImpl$5.run(AsyncJobManagerImpl.java:568) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) Caused by: com.cloud.exception.InsufficientServerCapacityException: Unable to create a deployment for VM instance {"id":89770,"instanceName":"xyz-VM","type":"User","uuid":"5ece1bb3-22c0-4482-86b3-eff04b2b7e38"}Scope=interface com.cloud.dc.DataCenter; id=1 at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:1226) at com.cloud.vm.VirtualMachineManagerImpl.orchestrateStart(VirtualMachineManagerImpl.java:5412) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ... 18 more 2024-02-05 14:44:19,778 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-14:ctx-5789063c job-295587) (logid:5f922a22) Complete async job-295587, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/{"uuidList":[],"errorcode":"530","errortext":"Unable to start a VM [5ece1bb3-22c0-4482-86b3-eff04b2b7e38] due to [Unable to create a deployment for VM instance {"id":89770,"instanceName":"xyz-VM","type":"User","uuid":"5ece1bb3-22c0-4482-86b3-eff04b2b7e38"}]."}

@yashi4engg it would be good to share all the logs of the job

yashi4engg commented 8 months ago

We were able to create VMs now and hosts also added back to cloudstack ... But still we had one question in mind.

Is there any change from 4.18.0 to 4.18.1 so it causes that issue where same hypervisors were added to cloudstack without any change in 4.18.0 but as soon as we upgraded 4.18.1 even OS version remained same and no updated in OS files it was unable to add and needed change in host.OS property.

Expected -- It shoul dadded back without any change as it was added earlier with same properties.

weizhouapache commented 8 months ago

We were able to create VMs now and hosts also added back to cloudstack ... But still we had one question in mind.

Is there any change from 4.18.0 to 4.18.1 so it causes that issue where same hypervisors were added to cloudstack without any change in 4.18.0 but as soon as we upgraded 4.18.1 even OS version remained same and no updated in OS files it was unable to add and needed change in host.OS property.

Expected -- It shoul dadded back without any change as it was added earlier with same properties.

I agree with you @yashi4engg

any idea to fix it @DaanHoogland ? This is related to #7570

DaanHoogland commented 8 months ago

If I read this correctly the file /etc/redhat-release was editted. this is not the correct procedure. Instead the host details for the hosts in the cluster should be updated. I see this didn t make it into the release notes.

yashi4engg commented 8 months ago

@DaanHoogland -- I agree with you but as a work around we did that. As host.OS propery already showing Oracle in DB but still host was unable to join cluster So we made this change and host was able to join.

You suggest to update host.os property to redhat rather then updating it to release file ?

DaanHoogland commented 8 months ago

I would sugest editing the host-detail in the database for the hosts in the cluster to match the contents of the redhat-release file. In that way freshly installed hosts should be able to join the cluster without further manipulation in /etc.

can you share the original contents of /etc/redhat-release and the value that you replaced it with?

yashi4engg commented 8 months ago

cat /etc/redhat-release Red Hat Enterprise Linux release 9.2 (Plow)

cat /etc/oracle-release Oracle Linux Server release 9.2

Now i have fixed the issue after updating Host.OS value in DB and reverted redhat-release contents as those with default installation as above.

yashi4engg commented 8 months ago

Issue is now resolved for us after updating Host.OS value but concern here is it should be not the case general scenario and host should be added by default without any change after upgrade.

DaanHoogland commented 8 months ago

@yashi4engg this is an omission in the installation notes. every host el that contains more than one work before "release" in their /etc/redhat-release file, should have that detail updated in the DB. I remember we discussed this, but it slipped through the cracks somehow. cc @shwstppr @mlsorensen @rohityadavcloud I'll start a doc PR for this.

DaanHoogland commented 8 months ago

I'll start a doc PR for this.

On second though, I'll first give it some though as to if it can be/should have been automated.

weizhouapache commented 8 months ago

@yashi4engg this is an omission in the installation notes. every host el that contains more than one work before "release" in their /etc/redhat-release file, should have that detail updated in the DB. I remember we discussed this, but it slipped through the cracks somehow. cc @shwstppr @mlsorensen @rohityadavcloud I'll start a doc PR for this.

@DaanHoogland I suggest to add a list of campatible OSes

which includes

If we get version from /etc/oracle-release if it exists, we could add

DaanHoogland commented 8 months ago

Your PR would solve the issue completely as we can just add strings like "Red" and "Red Hat" in the list.

yashi4engg commented 8 months ago

I checked it in bit details and found file which is responsible for check hypervisor OS version "/usr/share/cloudstack-common/scripts/vm/hypervisor/versions.sh" and according file it first looks on redhat-release and if exist it get details from there.

if [ -f /etc/redhat-release ] ; then get_from_redhat_release if [ -z "$REV" ] && [ -f /etc/os-release ]; then get_from_os_release fi elif [ -f /etc/lsb-release ] ; then get_from_lsb_release elif [ -f /etc/os-release ] ; then get_from_os_release fi

weizhouapache commented 8 months ago

I checked it in bit details and found file which is responsible for check hypervisor OS version "/usr/share/cloudstack-common/scripts/vm/hypervisor/versions.sh" and according file it first looks on redhat-release and if exist it get details from there.

if [ -f /etc/redhat-release ] ; then get_from_redhat_release if [ -z "$REV" ] && [ -f /etc/os-release ]; then get_from_os_release fi elif [ -f /etc/lsb-release ] ; then get_from_lsb_release elif [ -f /etc/os-release ] ; then get_from_os_release fi

yes, this can be improved.

rohityadavcloud commented 3 months ago

Fixed by https://github.com/apache/cloudstack/pull/8641