apache / cloudstack

Apache CloudStack is an opensource Infrastructure as a Service (IaaS) cloud computing platform
https://cloudstack.apache.org/
Apache License 2.0
2.05k stars 1.1k forks source link

The cloudstack-agent can not work when upgrade from 4.17.2.0 to 4.18.0.0 #7742

Closed xuanyuanaosheng closed 1 year ago

xuanyuanaosheng commented 1 year ago
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
  1. cloudstack-management
    
    # cat /etc/os-release 
    NAME="CentOS Linux"
    VERSION="7 (Core)"
    ID="centos"
    ID_LIKE="rhel fedora"
    VERSION_ID="7"
    PRETTY_NAME="CentOS Linux 7 (Core)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:centos:centos:7"
    HOME_URL="https://www.centos.org/"
    BUG_REPORT_URL="https://bugs.centos.org/"

CENTOS_MANTISBT_PROJECT="CentOS-7" CENTOS_MANTISBT_PROJECT_VERSION="7" REDHAT_SUPPORT_PRODUCT="centos" REDHAT_SUPPORT_PRODUCT_VERSION="7"

rpm -qa | grep cloudstack

cloudstack-management-4.18.0.0-1.el7.x86_64 cloudstack-common-4.18.0.0-1.el7.x86_64

2. cloudstack-agent

cat /etc/os-release

NAME="Oracle Linux Server" VERSION="8.8" ID="ol" ID_LIKE="fedora" VARIANT="Server" VARIANT_ID="server" VERSION_ID="8.8" PLATFORM_ID="platform:el8" PRETTY_NAME="Oracle Linux Server 8.8" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:oracle:linux:8:8:server" HOME_URL="https://linux.oracle.com/" BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8" ORACLE_BUGZILLA_PRODUCT_VERSION=8.8 ORACLE_SUPPORT_PRODUCT="Oracle Linux" ORACLE_SUPPORT_PRODUCT_VERSION=8.8

rpm -qa | grep cloudstack

cloudstack-common-4.18.0.0-1.x86_64 cloudstack-agent-4.18.0.0-1.x86_64


##### SUMMARY
when I upgrade the cloudstack-agent from 4.17.2.0 to 4.18.0.0, the cloudstack-agent can not work fine.

The cloudstack agent status 

systemctl status cloudstack-agent

● cloudstack-agent.service - CloudStack Agent Loaded: loaded (/usr/lib/systemd/system/cloudstack-agent.service; enabled; vendor preset: disabled) Active: activating (auto-restart) (Result: exit-code) since Fri 2023-06-30 12:16:48 CST; 3s ago Docs: http://www.cloudstack.org/ Process: 113054 ExecStart=/usr/bin/java $JAVA_OPTS $JAVA_DEBUG -cp $CLASSPATH $JAVA_CLASS (code=exited, status=1/FAILURE) Main PID: 113054 (code=exited, status=1/FAILURE)


The details error log is

2023-07-13 16:40:59,072 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/versions.sh 2023-07-13 16:40:59,074 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Executing while with timeout : 1800000 2023-07-13 16:40:59,091 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,102 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Executing: sudo grep InitiatorName= /etc/iscsi/initiatorname.iscsi 2023-07-13 16:40:59,103 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Executing while with timeout : 3600000 2023-07-13 16:40:59,119 DEBUG [kvm.resource.LibvirtComputingResource] (Agent-Handler-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,121 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system 2023-07-13 16:40:59,124 ERROR [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Connection with libvirtd is broken: invalid connection pointer in virConnectGetVersion 2023-07-13 16:40:59,124 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Opening a new libvirtd connection to: qemu:///system 2023-07-13 16:40:59,126 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Executing: qemu-img --help 2023-07-13 16:40:59,127 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Executing while with timeout : 3600000 2023-07-13 16:40:59,131 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,134 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Executing: cryptsetup --usage 2023-07-13 16:40:59,134 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Executing while with timeout : 3600000 2023-07-13 16:40:59,138 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,138 DEBUG [utils.script.Script] (Agent-Handler-1:null) (logid:) Usage: cryptsetup [-?Vvyrq] [-?|--help] [--usage] [-V|--version] [-v|--verbose] [--debug] [--debug-json] [-c|--cipher=STRING] [-h|--hash=STRING] [-y|--verify-passphrase] [-d|--key-file=STRING] [--master-key-file=STRING] [--dump-master-key] [-s|--key-size=BITS] [-l|--keyfile-size=bytes] [--keyfile-offset=bytes] [--new-keyfile-size=bytes] [--new-keyfile-offset=bytes] [-S|--key-slot=INT] [-b|--size=SECTORS] [--device-size=bytes] [-o|--offset=SECTORS] [-p|--skip=SECTORS] [-r|--readonly] [-q|--batch-mode] [-t|--timeout=secs] [--progress-frequency=secs] [-T|--tries=INT] [--align-payload=SECTORS] [--header-backup-file=STRING] [--use-random] [--use-urandom] [--shared] [--uuid=STRING] [--allow-discards] [--header=STRING] [--test-passphrase] [--tcrypt-hidden] [--tcrypt-system] [--tcrypt-backup] [--veracrypt] [--veracrypt-pim=INT] [--veracrypt-query-pim] [-M|--type=STRING] [--force-password] [--perf-same_cpu_crypt] [--perf-submit_from_crypt_cpus] [--perf-no_read_workqueue] [--perf-no_write_workqueue] [--deferred] [--serialize-memory-hard-pbkdf] [-i|--iter-time=msecs] [--pbkdf=STRING] [--pbkdf-memory=kilobytes] [--pbkdf-parallel=threads] [--pbkdf-force-iterations=LONG] [--priority=STRING] [--disable-locks] [--disable-keyring] [-I|--integrity=STRING] [--integrity-no-journal] [--integrity-no-wipe] [--integrity-legacy-padding] [--token-only] [--token-id=INT] [--key-description=STRING] [--sector-size=INT] [--iv-large-sectors] [--persistent] [--label=STRING] [--subsystem=STRING] [--unbound] [--json-file=STRING] [--luks2-metadata-size=bytes] [--luks2-keyslots-size=bytes] [--refresh] [--keyslot-key-size=BITS] [--keyslot-cipher=STRING] [--encrypt] [--decrypt] [--init-only] [--resume-only] [--reduce-device-size=bytes] [--hotzone-size=bytes] [--resilience=STRING] [--resilience-hash=STRING] [--active-name=STRING] [OPTION...]

2023-07-13 16:40:59,141 INFO [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:) Attempting to create storage pool 37ca62bd-24fd-4bd5-b569-0890fce8fe90 (Filesystem) in libvirt 2023-07-13 16:40:59,141 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system 2023-07-13 16:40:59,142 INFO [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:) Found existing defined storage pool 37ca62bd-24fd-4bd5-b569-0890fce8fe90, using it. 2023-07-13 16:40:59,143 INFO [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:) Trying to fetch storage pool 37ca62bd-24fd-4bd5-b569-0890fce8fe90 from libvirt 2023-07-13 16:40:59,143 DEBUG [kvm.resource.LibvirtConnection] (Agent-Handler-1:null) (logid:) Looking for libvirtd connection at: qemu:///system 2023-07-13 16:40:59,156 DEBUG [kvm.storage.LibvirtStorageAdaptor] (Agent-Handler-1:null) (logid:) Successfully refreshed pool 37ca62bd-24fd-4bd5-b569-0890fce8fe90 Capacity: (49.98 GB) 53660876800 Used: (11.11 GB) 11928494080 Available: (38.87 GB) 41732382720 2023-07-13 16:40:59,203 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Executing: hostname 2023-07-13 16:40:59,204 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Executing while with timeout : 500 2023-07-13 16:40:59,205 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,206 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Executing: hostname 2023-07-13 16:40:59,206 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Executing while with timeout : 500 2023-07-13 16:40:59,207 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,228 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Sending Startup: Seq 0-0: { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 1, [{"com.cloud.agent.api.StartupRoutingCommand":{"cpuSockets":"2","cpus":"56","speed":"2600","memory":"809151959040","dom0MinMemory":"1073741824","poolSync":"false","supportsClonedVolumes":"false","caps":"hvm,snapshot","pool":"/root","hypervisorType":"KVM","hostDetails":{"Host.OS.Kernel.Version":"5.4.17-2136.321.4.el8uek.x86_64","com.cloud.network.Networks.RouterPrivateIpStrategy":"HostLocal","Host.OS.Version":"8.8","host.volume.encryption":"true","secured":"true","Host.OS":"Oracle Linux Server"},"hostTags":[],"groupDetails":{},"type":"Routing","dataCenter":"10","pod":"9","cluster":"10","guid":"5d98d8a3-57ae-3c08-b2a3-6666a8fefc1e-LibvirtComputingResource","name":"ljzdckvm001.cn.prod","id":"0","version":"4.18.0.0","iqn":"iqn.1988-12.com.oracle:acefa7ff557d","publicIpAddress":"10.67.128.1","publicNetmask":"255.255.255.0","publicMacAddress":"dc:f4:01:dd:a9:8f","privateIpAddress":"10.67.128.1","privateMacAddress":"dc:f4:01:dd:a9:8f","privateNetmask":"255.255.255.0","storageIpAddress":"10.67.128.1","storageNetmask":"255.255.255.0","storageMacAddress":"dc:f4:01:dd:a9:8f","resourceName":"LibvirtComputingResource","gatewayIpAddress":"10.67.128.254","msHostList":"10.25.2.173@static","wait":"0","bypassHostMaintenance":"false"}},{"com.cloud.agent.api.StartupStorageCommand":{"totalSize":"(0 bytes) 0","poolInfo":{"uuid":"37ca62bd-24fd-4bd5-b569-0890fce8fe90","host":"10.67.128.1","localPath":"/var/lib/libvirt/images","hostPath":"/var/lib/libvirt/images","poolType":"Filesystem","capacityBytes":"(49.98 GB) 53660876800","availableBytes":"(38.87 GB) 41732382720"},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"Storage","dataCenter":"10","pod":"9","guid":"5d98d8a3-57ae-3c08-b2a3-6666a8fefc1e-LibvirtComputingResource","name":"ljzdckvm001.cn.prod","id":"0","version":"4.18.0.0","resourceName":"LibvirtComputingResource","msHostList":"10.25.2.173@static","wait":"0","bypassHostMaintenance":"false"}}] } 2023-07-13 16:40:59,228 DEBUG [cloud.agent.Agent] (Agent-Handler-1:null) (logid:) Startup task created 2023-07-13 16:40:59,311 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:) Request:Seq -1--1: { Cmd , MgmtId: -1, via: -1, Ver: v1, Flags: 111, [{"com.cloud.agent.api.ReadyCommand":{"_details":"java.lang.IllegalArgumentException: Can't add host: 10.67.128.1 with hostOS: Oracle Linux Server into a cluster,in which there are Red hosts added","wait":"0","bypassHostMaintenance":"false"}}] } 2023-07-13 16:40:59,311 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Received response: Seq 0-0: { Ans: , MgmtId: 345052215515, via: -1, Ver: v1, Flags: 100000, [{"com.cloud.agent.api.StartupAnswer":{"hostId":"0","pingInterval":"60","result":"true","wait":"0","bypassHostMaintenance":"false"}}] } 2023-07-13 16:40:59,311 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:) Processing command: com.cloud.agent.api.ReadyCommand 2023-07-13 16:40:59,312 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Startup task cancelled 2023-07-13 16:40:59,312 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Process agent startup answer, agent id = 0 2023-07-13 16:40:59,312 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Set agent id 0 2023-07-13 16:40:59,312 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:) Not ready to connect to mgt server: java.lang.IllegalArgumentException: Can't add host: 10.67.128.1 with hostOS: Oracle Linux Server into a cluster,in which there are Red hosts added 2023-07-13 16:40:59,314 INFO [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Stopping the agent: Reason = sig.kill 2023-07-13 16:40:59,315 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) (logid:) Sending shutdown to management server 2023-07-13 16:40:59,319 DEBUG [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Adding a watch list 2023-07-13 16:40:59,319 INFO [cloud.agent.Agent] (Agent-Handler-2:null) (logid:) Startup Response Received: agent id = 0 2023-07-13 16:40:59,323 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-1:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/network/security_group.py get_rule_logs_for_vms 2023-07-13 16:40:59,324 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-1:null) (logid:) Executing while with timeout : 1800000 2023-07-13 16:40:59,398 DEBUG [kvm.resource.LibvirtComputingResource] (UgentTask-1:null) (logid:) Execution is successful. 2023-07-13 16:40:59,401 DEBUG [kvm.resource.LibvirtConnection] (UgentTask-1:null) (logid:) Looking for libvirtd connection at: qemu:///system 2023-07-13 16:40:59,404 DEBUG [cloud.agent.Agent] (UgentTask-1:null) (logid:) Sending ping: Seq 0-1: { Cmd , MgmtId: -1, via: 0, Ver: v1, Flags: 11, [{"com.cloud.agent.api.PingRoutingWithNwGroupsCommand":{"newGroupStates":{},"_hostVmStateReport":{},"_gatewayAccessible":"true","_vnetAccessible":"true","hostType":"Routing","hostId":"0","wait":"0","bypassHostMaintenance":"false"}}] }


I checked the https://github.com/apache/cloudstack/issues/3509 , the agent env is 

rpm -qa | grep libvirt

libvirt-daemon-driver-nodedev-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-config-nwfilter-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-lock-sanlock-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-gluster-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-docs-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-libs-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-logical-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-core-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-config-network-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-scsi-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-devel-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-gobject-3.0.0-1.el8.x86_64 libvirt-daemon-driver-interface-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-iscsi-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-client-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-secret-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-mpath-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-nss-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 python3-libvirt-8.0.0-2.module+el8.8.0+20990+60c1530a.x86_64 libvirt-dbus-1.3.0-2.module+el8.8.0+20990+60c1530a.x86_64 libvirt-daemon-driver-network-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-disk-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-gconfig-3.0.0-1.el8.x86_64 libvirt-daemon-driver-storage-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-wireshark-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 pcp-pmda-libvirt-5.3.7-16.0.2.el8.x86_64 libvirt-daemon-driver-storage-iscsi-direct-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-storage-rbd-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-glib-3.0.0-1.el8.x86_64 libvirt-daemon-driver-nwfilter-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-driver-qemu-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64 libvirt-daemon-kvm-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64

rpm -qa | grep qemu

qemu-guest-agent-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-ui-opengl-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-ui-spice-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-hw-usbredir-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-block-ssh-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-docs-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-block-rbd-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-common-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-block-curl-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-core-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-img-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-block-iscsi-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 qemu-kvm-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 ipxe-roms-qemu-20181214-11.git133f4c47.el8.noarch qemu-kvm-block-gluster-6.2.0-32.module+el8.8.0+21044+01700444.x86_64 libvirt-daemon-driver-qemu-8.0.0-19.0.2.module+el8.8.0+21112+1cc1a24b.x86_64

uname -a

Linux test001 5.4.17-2136.321.4.el8uek.x86_64 #2 SMP Mon Jun 26 18:17:37 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux



I restart the agent node and system, It still can not work.

I think it is a  libvirtd and cloudstack connection problem,  cloud you give some advices? I config the libvirtd using :https://www.sbarjatiya.com/notes_wiki/index.php/CentOS_8.x_Cloudstack_4.15_Setup_KVM_host

The unupgrade agent node (4.17.2.0) is still work fine.
DaanHoogland commented 1 year ago

@xuanyuanaosheng , I see no error in the detail log you provided. Is this all the log you have? Can you to add a freshly installed host with a non-upgraded 4.18 agent?

xuanyuanaosheng commented 1 year ago

@DaanHoogland I was editing the issue, please see it again.

This is the agent log, please also see it.

agent.log

If need more info, I am glad to provide it.

xuanyuanaosheng commented 1 year ago

@DaanHoogland I am using below cmd to get the agent log, please also see it.

/usr/bin/java -Djava.io.tmpdir=/usr/share/cloudstack-agent/tmp -Xms256m -Xmx2048m -cp /usr/share/cloudstack-agent/lib/*:/usr/share/cloudstack-agent/plugins/*:/etc/cloudstack/agent:/usr/share/cloudstack-common/scripts com.cloud.agent.AgentShell &> cmd.log

cmd.log

I roll back cloudstack-agent from 4.18.0.0 to 4.17.2.0, The cloudstack-agent works fine.

# yum install ./cloudstack-common-4.17.2.0-1.el8.x86_64.rpm ./cloudstack-agent-4.17.2.0-1.el8.x86_64.rpm 
Last metadata expiration check: 3:12:52 ago on Thu 13 Jul 2023 01:55:17 PM CST.
Dependencies resolved.
=======================================================================================================================================================
 Package                                 Architecture                 Version                                 Repository                          Size
=======================================================================================================================================================
Downgrading:
 cloudstack-agent                        x86_64                       4.17.2.0-1.el8                          @commandline                        77 M
 cloudstack-common                       x86_64                       4.17.2.0-1.el8                          @commandline                        95 M

Transaction Summary
=======================================================================================================================================================
Downgrade  2 Packages

Total size: 172 M
Is this ok [y/N]: y
Downloading Packages:
Running transaction check
Transaction check succeeded.
Running transaction test
Transaction test succeeded.
Running transaction
  Preparing        :                                                                                                                               1/1 
  Running scriptlet: cloudstack-common-4.17.2.0-1.el8.x86_64                                                                                       1/1 
  Downgrading      : cloudstack-common-4.17.2.0-1.el8.x86_64                                                                                       1/4 
  Running scriptlet: cloudstack-agent-4.17.2.0-1.el8.x86_64                                                                                        2/4 
  Downgrading      : cloudstack-agent-4.17.2.0-1.el8.x86_64                                                                                        2/4 
warning: /etc/cloudstack/agent/agent.properties created as /etc/cloudstack/agent/agent.properties.rpmnew

  Running scriptlet: cloudstack-agent-4.17.2.0-1.el8.x86_64                                                                                        2/4 
Running /usr/bin/cloudstack-agent-upgrade to update bridge name for upgrade from CloudStack 4.0.x (and before) to CloudStack 4.1 (and later)
Upgrade succeed
Redirecting to /bin/systemctl restart libvirtd.service

  Running scriptlet: cloudstack-agent-4.18.0.0-1.x86_64                                                                                            3/4 
Redirecting to /bin/systemctl stop cloudstack-agent.service

  Cleanup          : cloudstack-agent-4.18.0.0-1.x86_64                                                                                            3/4 
  Cleanup          : cloudstack-common-4.18.0.0-1.x86_64                                                                                           4/4 
  Running scriptlet: cloudstack-agent-4.17.2.0-1.el8.x86_64                                                                                        4/4 

Thank you for installing Apache CloudStack Agent 4.17.2.0 !

 * Project website:     https://cloudstack.apache.org/
 * Documentation:       https://docs.cloudstack.apache.org/en/4.17.2.0/
 * Release notes:       https://docs.cloudstack.apache.org/en/4.17.2.0/releasenotes/
 * Join mailing lists:  https://cloudstack.apache.org/mailing-lists.html
 * Take the survey:     https://cloudstack.apache.org/survey.html
 * Report issues:       https://github.com/apache/cloudstack/issues/new

  Running scriptlet: cloudstack-common-4.18.0.0-1.x86_64                                                                                           4/4 
  Verifying        : cloudstack-common-4.17.2.0-1.el8.x86_64                                                                                       1/4 
  Verifying        : cloudstack-common-4.18.0.0-1.x86_64                                                                                           2/4 
  Verifying        : cloudstack-agent-4.17.2.0-1.el8.x86_64                                                                                        3/4 
  Verifying        : cloudstack-agent-4.18.0.0-1.x86_64                                                                                            4/4 

Downgraded:
  cloudstack-agent-4.17.2.0-1.el8.x86_64                                    cloudstack-common-4.17.2.0-1.el8.x86_64                                   

Complete!

# systemctl status cloudstack-agent.service 
● cloudstack-agent.service - CloudStack Agent
   Loaded: loaded (/usr/lib/systemd/system/cloudstack-agent.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2023-07-13 17:08:52 CST; 2min 31s ago
     Docs: http://www.cloudstack.org/
 Main PID: 71471 (java)
    Tasks: 69 (limit: 3355442)
   Memory: 209.4M
   CGroup: /system.slice/cloudstack-agent.service
           └─71471 /usr/bin/java -Djava.io.tmpdir=/usr/share/cloudstack-agent/tmp -Xms256m -Xmx2048m -cp /usr/share/cloudstack-agent/lib/*:/usr/share

the log is agent-30000.log

DaanHoogland commented 1 year ago

so to be clear cmd.log and agent.log are of agents that do not work, and agent-30000.log is of the version that works as expected?

xuanyuanaosheng commented 1 year ago

@DaanHoogland Yes. cmd.log and agent.log are of agents (4.18.0.0) that do not work, cmd.log is using cmd /usr/bin/java -Djava.io.tmpdir=/usr/share/cloudstack-agent/tmp -Xms256m -Xmx2048m -cp /usr/share/cloudstack-agent/lib/*:/usr/share/cloudstack-agent/plugins/*:/etc/cloudstack/agent:/usr/share/cloudstack-common/scripts com.cloud.agent.AgentShell &> cmd.log to start the cloudstack-agent, agent.log is using systemctl to start the cloudstack-agent (/var/log/cloudstack/agent/agent.log).

The agent-30000.log is the log (works fine) that roll back to 4.17.2.0.

xuanyuanaosheng commented 1 year ago

@DaanHoogland

I reinstall a new server using 4.18.0.0, But the cloudstack-agent can not started, it quit quickly.

The details logs is:

log4j:WARN No appenders could be found for logger (com.cloud.agent.AgentShell).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
log4j: reset attribute= "false".
log4j: Threshold ="null".
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [com.cloud] additivity to [true].
log4j: Level value for com.cloud is  [DEBUG].
log4j: com.cloud level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [com.cloud.agent.metrics] additivity to [true].
log4j: Level value for com.cloud.agent.metrics is  [DEBUG].
log4j: com.cloud.agent.metrics level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [com.cloud.agent.resource.computing.ComputingResource$StorageMonitorTask] additivity to [true].
log4j: Level value for com.cloud.agent.resource.computing.ComputingResource$StorageMonitorTask is  [DEBUG].
log4j: com.cloud.agent.resource.computing.ComputingResource$StorageMonitorTask level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [org.apache] additivity to [true].
log4j: Level value for org.apache is  [DEBUG].
log4j: org.apache level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [org] additivity to [true].
log4j: Level value for org is  [DEBUG].
log4j: org level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [net] additivity to [true].
log4j: Level value for net is  [DEBUG].
log4j: net level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [com.amazonaws] additivity to [true].
log4j: Level value for com.amazonaws is  [DEBUG].
log4j: com.amazonaws level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [httpclient.wire] additivity to [true].
log4j: Level value for httpclient.wire is  [DEBUG].
log4j: httpclient.wire level set to DEBUG
log4j: Retreiving an instance of org.apache.log4j.Logger.
log4j: Setting [org.apache.http.wire] additivity to [true].
log4j: Level value for org.apache.http.wire is  [DEBUG].
log4j: org.apache.http.wire level set to DEBUG
log4j: Level value for root is  [DEBUG].
log4j: root level set to DEBUG
log4j: Class name: [org.apache.log4j.ConsoleAppender]
log4j: Setting property [target] to [System.out].
log4j: Setting property [threshold] to [DEBUG].
log4j: Parsing layout of class: "org.apache.log4j.PatternLayout"
log4j: Setting property [conversionPattern] to [%-5p [%c{3}] (%t:%x) (logid:%X{logcontextid}) %m%n].
log4j: Adding appender named [CONSOLE] to category [root].
log4j: Class name: [org.apache.log4j.rolling.RollingFileAppender]
log4j: Setting property [append] to [true].
log4j: Setting property [threshold] to [DEBUG].
log4j: Setting property [fileNamePattern] to [/var/log/cloudstack/agent/agent.log.%d{yyyy-MM-dd}.gz].
log4j: Setting property [activeFileName] to [/var/log/cloudstack/agent/agent.log].
log4j: Parsing layout of class: "org.apache.log4j.EnhancedPatternLayout"
log4j: Setting property [conversionPattern] to [%d{ISO8601} %-5p [%c{3}] (%t:%x) (logid:%X{logcontextid}) %m%n].
log4j: setFile called: /var/log/cloudstack/agent/agent.log, true
log4j: setFile ended
log4j: Adding appender named [FILE] to category [root].
INFO  [cloud.agent.AgentShell] (main:) (logid:) Agent started
INFO  [cloud.agent.AgentShell] (main:) (logid:) Implementation Version is 4.18.0.0
INFO  [cloud.agent.AgentShell] (main:) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
DEBUG [agent.properties.AgentPropertiesFileHandler] (main:) (logid:) Property [host] was altered. Now using the value [localhost].
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
DEBUG [agent.properties.AgentPropertiesFileHandler] (main:) (logid:) Property [guid] has empty or null value. Using default value [null].
DEBUG [agent.properties.AgentPropertiesFileHandler] (main:) (logid:) Property [developer] has empty or null value. Using default value [false].
Unable to find the guid
DaanHoogland commented 1 year ago

I reinstall a new server using 4.18.0.0, But the cloudstack-agent can not started, it quit quickly.

This sounds very serious but I have not heard of this before. We run fresh environments on a daily basis in tests.So let's investigate further;

First check this,

# MANDATORY: The GUID to identify the agent with.
# Generated with "uuidgen"
guid=

Do the upgraded or old hosts have this filled?

The extensive log4j output on start is also new to me. Can you compare that to the upgraded host?

the line

DEBUG [agent.properties.AgentPropertiesFileHandler] (main:) (logid:) Property [host] was altered. Now using the value [localhost].

also seems strange. can you also compare and check if the host has a valid address?

harikrishna-patnala commented 1 year ago

The issue seems to be with cryptsetup version on the host or it could be combination of "Oracle Linux Server" and cryptsetup.

@xuanyuanaosheng can you please check the cryptsetup version on your host.

On our hosts these are the versions. [root@pr7727-t7011-kvm-centos7-kvm1 ~]# cryptsetup --version cryptsetup 2.0.3 [root@pr7511-t6776-kvm-rocky8-kvm1 ~]# cryptsetup --version cryptsetup 2.3.7

and the reason it is working in 4.17 is seems to be because cryptsetup usage is introduced in 4.18 release. I know its usage in PowerFlex storage for encryption, not sure about other usecases.

xuanyuanaosheng commented 1 year ago

@DaanHoogland

The cloudstack can started when the host do not add to cloudstack cluster in 4.17.2.0.

The cloudstack can not started when the host do not add to cloudstack cluster in 4.18.0.0.

The start log is

2023-07-14 18:37:02,081 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:02,085 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:02,086 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:02,094 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:02,118 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:02,118 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
2023-07-14 18:37:12,543 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:12,546 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:12,547 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:12,555 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:12,580 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:12,580 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
2023-07-14 18:37:23,041 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:23,044 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:23,045 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:23,053 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:23,078 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:23,078 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
2023-07-14 18:37:24,450 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:24,453 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:24,454 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:24,462 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:24,486 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:24,487 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
2023-07-14 18:37:35,041 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:35,044 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:35,045 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:35,053 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:35,077 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:35,078 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
2023-07-14 18:37:45,543 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:45,547 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:45,548 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:45,556 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:45,580 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:45,581 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
2023-07-14 18:37:56,042 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-14 18:37:56,046 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.18.0.0
2023-07-14 18:37:56,047 INFO  [cloud.agent.AgentShell] (main:null) (logid:) agent.properties found at /etc/cloudstack/agent/agent.properties
2023-07-14 18:37:56,055 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [host] was altered. Now using the value [localhost].
2023-07-14 18:37:56,079 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [guid] has empty or null value. Using default value [null].
2023-07-14 18:37:56,079 DEBUG [agent.properties.AgentPropertiesFileHandler] (main:null) (logid:) Property [developer] has empty or null value. Using default value [false].
xuanyuanaosheng commented 1 year ago

The issue seems to be with cryptsetup version on the host or it could be combination of "Oracle Linux Server" and cryptsetup.

@xuanyuanaosheng can you please check the cryptsetup version on your host.

On our hosts these are the versions. [root@pr7727-t7011-kvm-centos7-kvm1 ~]# cryptsetup --version cryptsetup 2.0.3 [root@pr7511-t6776-kvm-rocky8-kvm1 ~]# cryptsetup --version cryptsetup 2.3.7

and the reason it is working in 4.17 is seems to be because cryptsetup usage is introduced in 4.18 release. I know its usage in PowerFlex storage for encryption, not sure about other usecases.

@harikrishna-patnala I do not know why this happen. wish some one can do some test and give more advices for this issue.

# cryptsetup --version
cryptsetup 2.3.7

# cat /etc/os-release 
NAME="Oracle Linux Server"
VERSION="8.8"
ID="ol"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="8.8"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Oracle Linux Server 8.8"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:oracle:linux:8:8:server"
HOME_URL="https://linux.oracle.com/"
BUG_REPORT_URL="https://github.com/oracle/oracle-linux"

ORACLE_BUGZILLA_PRODUCT="Oracle Linux 8"
ORACLE_BUGZILLA_PRODUCT_VERSION=8.8
ORACLE_SUPPORT_PRODUCT="Oracle Linux"
ORACLE_SUPPORT_PRODUCT_VERSION=8.8

# uname -a
Linux test001 5.4.17-2136.321.4.el8uek.x86_64 #2 SMP Mon Jun 26 18:17:37 PDT 2023 x86_64 x86_64 x86_64 GNU/Linux
mlsorensen commented 1 year ago

I don't see an issue with cryptsetup in the log, it does print the test help output but that's normal. In their provided log, this is the issue:

2023-07-13 16:49:03,766 DEBUG [cloud.agent.Agent] (agentRequest-Handler-2:null) (logid:) Not ready to connect to mgt server: java.lang.IllegalArgumentException: Can't add host: 10.67.128.1 with hostOS: Oracle Linux Server into a cluster,in which there are Red hosts added

The target cluster is "Red" OS and this host is "Oracle" - mixing hypervisor OS in same cluster is not allowed.

mlsorensen commented 1 year ago

Find the cluster you are targeting, and check the OS details of other hosts that may be in this cluster. If they don't match you'll get this error.

Run this and see what its output is:

/usr/share/cloudstack-common/scripts/vm/hypervisor/versions.sh

With the agent 4.17 install you have, we see it actually advertises itself as "Red" Host.OS and joins. It complains about not being able to parse the output of versions.sh

2023-07-13 17:01:17,019 DEBUG [kvm.resource.LibvirtConnection] (AgentShutdownThread:null) (logid:) Looking for libvirtd connection at: qemu:///system
2023-07-13 17:08:53,321 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Agent started
2023-07-13 17:08:53,325 INFO  [cloud.agent.AgentShell] (main:null) (logid:) Implementation Version is 4.17.2.0
...
2023-07-13 17:08:53,854 DEBUG [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Executing: /usr/share/cloudstack-common/scripts/vm/hypervisor/versions.sh 
2023-07-13 17:08:53,855 DEBUG [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Executing while with timeout : 1800000
2023-07-13 17:08:53,867 DEBUG [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Execution is successful.
2023-07-13 17:08:53,867 WARN  [kvm.resource.LibvirtComputingResource] (main:null) (logid:) Failed to parse Script output: Host.OS.Version=

In contrast, when you start with agent 4.18, it seems to properly detect the Oracle OS and reports that, which causes a failure to join the cluster.

I think this is a fix in 4.18 that may have exposed that 4.17 is not working correctly in determining the host OS.

If you are wanting to upgrade this cluster to 4.18, I think you'll need to go update all of the host details to match @xuanyuanaosheng "Host.OS":"Oracle Linux Server" and get them back on track. When this host joins, the code finds another host in the cluster and checks its OS to compare, and throws the error we see above if they don't match.

xuanyuanaosheng commented 1 year ago

@DaanHoogland @harikrishna-patnala as @mlsorensen said, It is because the checked the OS don't match.

I upgrade all the node in cluster to 4.18, But is is still failed to upgrade.

I replace the /usr/share/cloudstack-common/scripts/vm/hypervisor/versions.sh in 4.18 host to the file on 4.17, It has worked.

I think may be the info about the host was stored in the databases.

Thanks all of you.