fabriziosalmi / proxmox-vm-autoscale

Automatically scale virtual machines resources on Proxmox hosts
MIT License
104 stars 2 forks source link

No cpu usage returned so scaling cpu down to min. #8

Open ErrorCode67 opened 1 week ago

ErrorCode67 commented 1 week ago

Proxmox: Virtual Environment 8.2.7 proxmox-vm-autoscale: fresh install on Ubuntu 24.04

Not pulling CPU usage and getting the following 2024-10-03 10:29:54,234 [WARNING] vm_resource_manager: Could not parse CPU usage information from output: balloon: 4294967296

From the log file: 2024-10-03 10:28:44,196 [INFO] ssh_utils: Command executed successfully on 192.168.2.46: qm status 67060 --verbose 2024-10-03 10:28:44,198 [WARNING] vm_resource_manager: Could not parse CPU usage information from output: balloon: 4294967296 ballooninfo: actual: 4294967296 free_mem: 280334336 last_update: 1727969322 major_page_faults: 4233 max_mem: 4294967296 mem_swapped_in: 0 mem_swapped_out: 0 minor_page_faults: 3377041 total_mem: 4162101248 blockstat: scsi0: account_failed: 1 account_invalid: 1 failed_flush_operations: 0 failed_rd_operations: 0 failed_unmap_operations: 0 failed_wr_operations: 0 failed_zone_append_operations: 0 flush_operations: 782 flush_total_time_ns: 10078838501 idle_time_ns: 2600370805 invalid_flush_operations: 0 invalid_rd_operations: 0 invalid_unmap_operations: 0 invalid_wr_operations: 0 invalid_zone_append_operations: 0 rd_bytes: 1475074048 rd_merged: 0 rd_operations: 37171 rd_total_time_ns: 360181298620 timed_stats: unmap_bytes: 0 unmap_merged: 0 unmap_operations: 0 unmap_total_time_ns: 0 wr_bytes: 75742208 wr_highest_offset: 19666206720 wr_merged: 0 wr_operations: 7974 wr_total_time_ns: 114147578043 zone_append_bytes: 0 zone_append_merged: 0 zone_append_operations: 0 zone_append_total_time_ns: 0 cpus: 1 disk: 0 diskread: 1475074048 diskwrite: 75742208 freemem: 280334336 maxdisk: 21474836480 maxmem: 4294967296 mem: 3881766912 name: DC01-Media-HandBrake netin: 1890299297 netout: 635162795 nics: tap67060i0: netin: 4136467 netout: 3604300 tap67060i1: netin: 1886162830 netout: 631558495 pid: 2405571 proxmox-support: backup-fleecing: 1 backup-max-workers: 1 pbs-dirty-bitmap: 1 pbs-dirty-bitmap-migration: 1 pbs-dirty-bitmap-savevm: 1 pbs-library-version: 1.4.1 (UNKNOWN) pbs-masterkey: 1 query-bitmap-info: 1 qmpstatus: running running-machine: pc-i440fx-9.0+pve0 running-qemu: 9.0.2 status: running uptime: 1527 vmid: 67060 2024-10-03 10:28:44,198 [INFO] vm_autoscale: VM 67060 - CPU Usage: 0.0%, RAM Usage: 6.527042388916016% 2024-10-03 10:28:44,198 [DEBUG] paramiko.transport: [chan 3] Max packet in: 32768 bytes 2024-10-03 10:28:44,238 [DEBUG] paramiko.transport: [chan 3] Max packet out: 32768 bytes 2024-10-03 10:28:44,238 [DEBUG] paramiko.transport: Secsh channel 3 opened. 2024-10-03 10:28:44,240 [DEBUG] paramiko.transport: [chan 3] Sesch channel 3 request ok 2024-10-03 10:28:45,894 [DEBUG] paramiko.transport: [chan 3] EOF received (3) 2024-10-03 10:28:45,895 [DEBUG] paramiko.transport: [chan 3] EOF sent (3) 2024-10-03 10:28:45,895 [INFO] ssh_utils: Command executed successfully on 192.168.2.46: pvesh get /nodes/$(hostname)/status --output-format json 2024-10-03 10:28:45,896 [INFO] host_resource_checker: Host CPU Usage: 0.00%, Host RAM Usage: 71.18% 2024-10-03 10:28:45,896 [INFO] vm_autoscale: Scaling down CPU for VM 67060 2024-10-03 10:28:45,896 [DEBUG] paramiko.transport: [chan 4] Max packet in: 32768 bytes 2024-10-03 10:28:45,936 [DEBUG] paramiko.transport: [chan 4] Max packet out: 32768 bytes 2024-10-03 10:28:45,936 [DEBUG] paramiko.transport: Secsh channel 4 opened. 2024-10-03 10:28:45,938 [DEBUG] paramiko.transport: [chan 4] Sesch channel 4 request ok 2024-10-03 10:28:47,311 [DEBUG] paramiko.transport: [chan 4] EOF received (4) 2024-10-03 10:28:47,312 [DEBUG] paramiko.transport: [chan 4] EOF sent (4) 2024-10-03 10:28:47,312 [INFO] ssh_utils: Command executed successfully on 192.168.2.46: qm config 67060 2024-10-03 10:28:47,313 [DEBUG] paramiko.transport: [chan 5] Max packet in: 32768 bytes 2024-10-03 10:28:47,353 [DEBUG] paramiko.transport: [chan 5] Max packet out: 32768 bytes 2024-10-03 10:28:47,353 [DEBUG] paramiko.transport: Secsh channel 5 opened. 2024-10-03 10:28:47,355 [DEBUG] paramiko.transport: [chan 5] Sesch channel 5 request ok 2024-10-03 10:28:48,680 [DEBUG] paramiko.transport: [chan 5] EOF received (5) 2024-10-03 10:28:48,680 [DEBUG] paramiko.transport: [chan 5] EOF sent (5) 2024-10-03 10:28:48,681 [INFO] ssh_utils: Command executed successfully on 192.168.2.46: qm config 67060 2024-10-03 10:28:48,681 [INFO] vm_resource_manager: Current Cores: 1, Current vCPUs: 1, Max Cores: 16, Min Cores: 1 2024-10-03 10:28:48,681 [INFO] vm_resource_manager: No scaling action required for CPU direction 'down'. 2024-10-03 10:28:48,681 [WARNING] vm_autoscale: No notification method is enabled or configured correctly. Message: Scaled down CPU for VM 67060 due to low usage. 2024-10-03 10:28:48,681 [INFO] vm_autoscale: Scaling down RAM for VM 67060 2024-10-03 10:28:48,682 [INFO] vm_resource_manager: Scaling operations are on cooldown. Next scaling allowed after 299 seconds. 2024-10-03 10:28:48,682 [WARNING] vm_autoscale: No notification method is enabled or configured correctly. Message: Scaled down RAM for VM 67060 due to low usage. 2024-10-03 10:28:48,682 [INFO] vm_autoscale: Scaling not enabled for VM 2011 2024-10-03 10:28:48,682 [INFO] ssh_utils: SSH connection closed for 192.168.2.46 2024-10-03 10:28:48,683 [DEBUG] paramiko.transport: Dropping user packet because connection is dead. 2024-10-03 10:28:48,946 [DEBUG] paramiko.transport: starting thread (client mode): 0x73141790 2024-10-03 10:28:48,946 [DEBUG] paramiko.transport: Local version/idstring: SSH-2.0-paramiko_2.12.0 2024-10-03 10:28:48,965 [DEBUG] paramiko.transport: Remote version/idstring: SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u3 2024-10-03 10:28:48,965 [INFO] paramiko.transport: Connected (version 2.0, client OpenSSH_9.2p1) 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: === Key exchange possibilities === 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: kex algos: sntrup761x25519-sha512@openssh.com, curve25519-sha256, curve25519-sha256@libssh.org, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, diffie-hellman-group-exchange-sha256, diffie-hellman-group16-sha512, diffie-hellman-group18-sha512, diffie-hellman-group14-sha256, kex-strict-s-v00@openssh.com 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: server key: rsa-sha2-512, rsa-sha2-256, ecdsa-sha2-nistp256, ssh-ed25519 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: client encrypt: chacha20-poly1305@openssh.com, aes128-ctr, aes192-ctr, aes256-ctr, aes128-gcm@openssh.com, aes256-gcm@openssh.com 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: server encrypt: chacha20-poly1305@openssh.com, aes128-ctr, aes192-ctr, aes256-ctr, aes128-gcm@openssh.com, aes256-gcm@openssh.com 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: client mac: umac-64-etm@openssh.com, umac-128-etm@openssh.com, hmac-sha2-256-etm@openssh.com, hmac-sha2-512-etm@openssh.com, hmac-sha1-etm@openssh.com, umac-64@openssh.com, umac-128@openssh.com, hmac-sha2-256, hmac-sha2-512, hmac-sha1 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: server mac: umac-64-etm@openssh.com, umac-128-etm@openssh.com, hmac-sha2-256-etm@openssh.com, hmac-sha2-512-etm@openssh.com, hmac-sha1-etm@openssh.com, umac-64@openssh.com, umac-128@openssh.com, hmac-sha2-256, hmac-sha2-512, hmac-sha1 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: client compress: none, zlib@openssh.com 2024-10-03 10:28:48,967 [DEBUG] paramiko.transport: server compress: none, zlib@openssh.com 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: client lang: 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: server lang: 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: kex follows: False 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: === Key exchange agreements === 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: Strict kex mode: True 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: Kex: curve25519-sha256@libssh.org 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: HostKey: ssh-ed25519 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: Cipher: aes128-ctr 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: MAC: hmac-sha2-256 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: Compression: none 2024-10-03 10:28:48,968 [DEBUG] paramiko.transport: === End of kex handshake === 2024-10-03 10:28:48,976 [DEBUG] paramiko.transport: Resetting outbound seqno after NEWKEYS due to strict mode 2024-10-03 10:28:48,976 [DEBUG] paramiko.transport: kex engine KexCurve25519 specified hash_algo 2024-10-03 10:28:48,976 [DEBUG] paramiko.transport: Switch to new keys ... 2024-10-03 10:28:48,977 [DEBUG] paramiko.transport: Resetting inbound seqno after NEWKEYS due to strict mode 2024-10-03 10:28:48,977 [DEBUG] paramiko.transport: Adding ssh-ed25519 host key for 192.168.2.47: b'6db19c67a49a087e9f60b2becc04f169' 2024-10-03 10:28:48,977 [DEBUG] paramiko.transport: Got EXT_INFO: {'server-sig-algs': b'ssh-ed25519,sk-ssh-ed25519@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,webauthn-sk-ecdsa-sha2-nistp256@openssh.com,ssh-dss,ssh-rsa,rsa-sha2-256,rsa-sha2-512', 'publickey-hostbound@openssh.com': b'0'} 2024-10-03 10:28:48,978 [DEBUG] paramiko.transport: Trying SSH key b'd28ad05c05c4f68c22d46a8ac640e82b' 2024-10-03 10:28:49,018 [DEBUG] paramiko.transport: userauth is OK 2024-10-03 10:28:49,018 [DEBUG] paramiko.transport: Finalizing pubkey algorithm for key of type 'ssh-rsa' 2024-10-03 10:28:49,018 [DEBUG] paramiko.transport: Our pubkey algorithm list: ['rsa-sha2-512', 'rsa-sha2-256', 'ssh-rsa'] 2024-10-03 10:28:49,018 [DEBUG] paramiko.transport: Server-side algorithm list: ['ssh-ed25519', 'sk-ssh-ed25519@openssh.com', 'ecdsa-sha2-nistp256', 'ecdsa-sha2-nistp384', 'ecdsa-sha2-nistp521', 'sk-ecdsa-sha2-nistp256@openssh.com', 'webauthn-sk-ecdsa-sha2-nistp256@openssh.com', 'ssh-dss', 'ssh-rsa', 'rsa-sha2-256', 'rsa-sha2-512'] 2024-10-03 10:28:49,018 [DEBUG] paramiko.transport: Agreed upon 'rsa-sha2-512' pubkey algorithm 2024-10-03 10:28:49,038 [INFO] paramiko.transport: Authentication (publickey) successful! 2024-10-03 10:28:49,039 [INFO] ssh_utils: Successfully connected to 192.168.2.47 2024-10-03 10:28:49,039 [INFO] ssh_utils: SSH connection closed for 192.168.2.47 2024-10-03 10:28:49,079 [DEBUG] paramiko.transport: EOF in transport thread 2024-10-03 10:28:49,324 [DEBUG] paramiko.transport: starting thread (client mode): 0x73142600 2024-10-03 10:28:49,324 [DEBUG] paramiko.transport: Local version/idstring: SSH-2.0-paramiko_2.12.0 2024-10-03 10:28:49,342 [DEBUG] paramiko.transport: Remote version/idstring: SSH-2.0-OpenSSH_9.2p1 Debian-2+deb12u3 2024-10-03 10:28:49,342 [INFO] paramiko.transport: Connected (version 2.0, client OpenSSH_9.2p1) 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: === Key exchange possibilities === 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: kex algos: sntrup761x25519-sha512@openssh.com, curve25519-sha256, curve25519-sha256@libssh.org, ecdh-sha2-nistp256, ecdh-sha2-nistp384, ecdh-sha2-nistp521, diffie-hellman-group-exchange-sha256, diffie-hellman-group16-sha512, diffie-hellman-group18-sha512, diffie-hellman-group14-sha256, kex-strict-s-v00@openssh.com 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: server key: rsa-sha2-512, rsa-sha2-256, ecdsa-sha2-nistp256, ssh-ed25519 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: client encrypt: chacha20-poly1305@openssh.com, aes128-ctr, aes192-ctr, aes256-ctr, aes128-gcm@openssh.com, aes256-gcm@openssh.com 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: server encrypt: chacha20-poly1305@openssh.com, aes128-ctr, aes192-ctr, aes256-ctr, aes128-gcm@openssh.com, aes256-gcm@openssh.com 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: client mac: umac-64-etm@openssh.com, umac-128-etm@openssh.com, hmac-sha2-256-etm@openssh.com, hmac-sha2-512-etm@openssh.com, hmac-sha1-etm@openssh.com, umac-64@openssh.com, umac-128@openssh.com, hmac-sha2-256, hmac-sha2-512, hmac-sha1 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: server mac: umac-64-etm@openssh.com, umac-128-etm@openssh.com, hmac-sha2-256-etm@openssh.com, hmac-sha2-512-etm@openssh.com, hmac-sha1-etm@openssh.com, umac-64@openssh.com, umac-128@openssh.com, hmac-sha2-256, hmac-sha2-512, hmac-sha1 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: client compress: none, zlib@openssh.com 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: server compress: none, zlib@openssh.com 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: client lang: 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: server lang: 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: kex follows: False 2024-10-03 10:28:49,344 [DEBUG] paramiko.transport: === Key exchange agreements === 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: Strict kex mode: True 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: Kex: curve25519-sha256@libssh.org 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: HostKey: ssh-ed25519 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: Cipher: aes128-ctr 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: MAC: hmac-sha2-256 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: Compression: none 2024-10-03 10:28:49,345 [DEBUG] paramiko.transport: === End of kex handshake === 2024-10-03 10:28:49,353 [DEBUG] paramiko.transport: Resetting outbound seqno after NEWKEYS due to strict mode 2024-10-03 10:28:49,353 [DEBUG] paramiko.transport: kex engine KexCurve25519 specified hash_algo 2024-10-03 10:28:49,353 [DEBUG] paramiko.transport: Switch to new keys ... 2024-10-03 10:28:49,354 [DEBUG] paramiko.transport: Resetting inbound seqno after NEWKEYS due to strict mode 2024-10-03 10:28:49,354 [DEBUG] paramiko.transport: Got EXT_INFO: {'server-sig-algs': b'ssh-ed25519,sk-ssh-ed25519@openssh.com,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521,sk-ecdsa-sha2-nistp256@openssh.com,webauthn-sk-ecdsa-sha2-nistp256@openssh.com,ssh-dss,ssh-rsa,rsa-sha2-256,rsa-sha2-512', 'publickey-hostbound@openssh.com': b'0'} 2024-10-03 10:28:49,354 [DEBUG] paramiko.transport: Adding ssh-ed25519 host key for 192.168.2.48: b'7192918905926e9d095dd73a9382f31c' 2024-10-03 10:28:49,355 [DEBUG] paramiko.transport: Trying SSH key b'd28ad05c05c4f68c22d46a8ac640e82b' 2024-10-03 10:28:49,395 [DEBUG] paramiko.transport: userauth is OK 2024-10-03 10:28:49,395 [DEBUG] paramiko.transport: Finalizing pubkey algorithm for key of type 'ssh-rsa' 2024-10-03 10:28:49,395 [DEBUG] paramiko.transport: Our pubkey algorithm list: ['rsa-sha2-512', 'rsa-sha2-256', 'ssh-rsa'] 2024-10-03 10:28:49,395 [DEBUG] paramiko.transport: Server-side algorithm list: ['ssh-ed25519', 'sk-ssh-ed25519@openssh.com', 'ecdsa-sha2-nistp256', 'ecdsa-sha2-nistp384', 'ecdsa-sha2-nistp521', 'sk-ecdsa-sha2-nistp256@openssh.com', 'webauthn-sk-ecdsa-sha2-nistp256@openssh.com', 'ssh-dss', 'ssh-rsa', 'rsa-sha2-256', 'rsa-sha2-512'] 2024-10-03 10:28:49,395 [DEBUG] paramiko.transport: Agreed upon 'rsa-sha2-512' pubkey algorithm 2024-10-03 10:28:49,413 [INFO] paramiko.transport: Authentication (publickey) successful! 2024-10-03 10:28:49,413 [INFO] ssh_utils: Successfully connected to 192.168.2.48 2024-10-03 10:28:49,414 [INFO] ssh_utils: SSH connection closed for 192.168.2.48 2024-10-03 10:28:49,454 [DEBUG] paramiko.transport: EOF in transport thread

fabriziosalmi commented 1 week ago

should be fixed here, can you confirm?

ErrorCode67 commented 1 week ago

now getting 024-10-03 12:31:08,328 [INFO] ssh_utils: Command executed successfully on 192.168.2.46: qm status 2011 --verbose 2024-10-03 12:31:08,329 [WARNING] vm_resource_manager: All parsing methods failed for output: balloon: 2147483648

Here is the output of qm status 2011 --verbose running directly on the host if that helps root@proxmox46:~# qm status 2011 --verbose balloon: 2147483648 ballooninfo: actual: 2147483648 free_mem: 271781888 last_update: 1727977063 major_page_faults: 4681 max_mem: 2147483648 mem_swapped_in: 618496 mem_swapped_out: 67739648 minor_page_faults: 11769259 total_mem: 2014617600 blockstat: scsi0: account_failed: 1 account_invalid: 1 failed_flush_operations: 0 failed_rd_operations: 0 failed_unmap_operations: 0 failed_wr_operations: 0 failed_zone_append_operations: 0 flush_operations: 938 flush_total_time_ns: 9957766936 idle_time_ns: 11299603622 invalid_flush_operations: 0 invalid_rd_operations: 0 invalid_unmap_operations: 0 invalid_wr_operations: 0 invalid_zone_append_operations: 0 rd_bytes: 1227949568 rd_merged: 0 rd_operations: 26551 rd_total_time_ns: 512402882573 timed_stats: unmap_bytes: 0 unmap_merged: 0 unmap_operations: 0 unmap_total_time_ns: 0 wr_bytes: 627861504 wr_highest_offset: 16984354816 wr_merged: 0 wr_operations: 9377 wr_total_time_ns: 493262597222 zone_append_bytes: 0 zone_append_merged: 0 zone_append_operations: 0 zone_append_total_time_ns: 0 cpus: 1 disk: 0 diskread: 1227949568 diskwrite: 627861504 freemem: 271781888 maxdisk: 17179869184 maxmem: 2147483648 mem: 1742835712 name: DC01-Infra-Jump netin: 60653761 netout: 439118 nics: tap2011i0: netin: 60653761 netout: 439118 pid: 2459387 proxmox-support: backup-fleecing: 1 backup-max-workers: 1 pbs-dirty-bitmap: 1 pbs-dirty-bitmap-migration: 1 pbs-dirty-bitmap-savevm: 1 pbs-library-version: 1.4.1 (UNKNOWN) pbs-masterkey: 1 query-bitmap-info: 1 qmpstatus: running running-machine: pc-i440fx-9.0+pve0 running-qemu: 9.0.2 status: running uptime: 6716 vmid: 2011

ErrorCode67 commented 1 week ago

I may be missing something but I am not seeing anything from qm status 2011 --verbose that is showing cpu load, etc other than the qty. Wondering if "qm status" on proxmox 8.2.7 is different from what it may of been in prior releases(s).

ErrorCode67 commented 1 week ago

Think I may have found a solution uses only one call to proxmox to get the data for ALL vms that we can parse through using jq.

ErrorCode67 commented 1 week ago

blank@vm-autoscale:~$ ssh root@192.168.2.48 "pvesh get /cluster/resources --output-format json" | jq '.[]|select(.vmid==67084)' { "cpu": 1.00684804343619, "disk": 0, "diskread": 22782740992, "diskwrite": 146019705856, "id": "qemu/67084", "maxcpu": 1, "maxdisk": 17179869184, "maxmem": 8589934592, "mem": 8172527616, "name": "DC01-DMZ-Wazuh", "netin": 6514637333, "netout": 77607329, "node": "proxmox48", "status": "running", "template": 0, "type": "qemu", "uptime": 39076, "vmid": 67084 }

ErrorCode67 commented 1 week ago

output=$(ssh root@192.168.2.48 "pvesh get /cluster/resources --output-format json") gives vm stats for all the vms across the cluster including which node it is currently living on. This also simplifies the config as you only need one Proxmox host defined (could define more for redundancy though in case one is down) as well as not having to define the proxmox_host for each virtual_machine

virtual_machines:

ErrorCode67 commented 1 week ago

I like it cause it is using one call to proxmox for all the stats on all the vms. With the json that is returned we can iterate through the virtual_machines defined in the config, jq the stats from output for that vm, see if we need to scale, if we do we know what host to do it on.

ErrorCode67 commented 1 week ago

Let me know if you would like me to contrib to the code. Am pretty busy but may be able to help of you want/need.

fabriziosalmi commented 1 week ago

Let me know if you would like me to contrib to the code. Am pretty busy but may be able to help of you want/need.

Yes of course I need help on that :) At the moment I cannot find free time, maybe in the next weekend. If you have some any time, just send PR like no tomorrow!

ErrorCode67 commented 1 week ago

Thinking about program flow for the different way of pulling data. Also as I was going through the flow I thought of some config additions that seem useful to me to give more fine tuned control. (Nothing quite like scope creep) program flow.txt config.yaml..txt

In addition, I ran across tracking pressure-stall. https://lwn.net/Articles/759781/ We have access to that data as well. I am not sure if useful or not. I have to really load up one of my cluster nodes and see what I get for data.

Still need to find some time to start implementing if you like what I have proposed.

fabriziosalmi commented 1 week ago
import logging
import re
import time

class VMResourceManager:
    def __init__(self, ssh_client, vm_id, config):
        self.ssh_client = ssh_client
        self.vm_id = vm_id
        self.config = config
        self.logger = logging.getLogger("vm_resource_manager")
        self.last_scale_time = 0  # Initialize last scale time for cooldown
        self.scale_cooldown = self.config.get('scale_cooldown', 300)  # Default cooldown 5 minutes

    def is_vm_running(self, retries=3, delay=5):
        """
        Check if the VM is running.
        :return: True if the VM is running, False otherwise.
        """
        for attempt in range(retries):
            try:
                command = f"qm status {self.vm_id} --verbose"
                output = self.ssh_client.execute_command(command)
                if "status: running" in output:
                    return True
                else:
                    self.logger.info(f"VM {self.vm_id} is not running. Skipping scaling operations.")
                    return False
            except Exception as e:
                self.logger.error(f"Failed to get status for VM {self.vm_id} (attempt {attempt + 1}): {str(e)}")
                time.sleep(delay)
        return False

    def get_resource_usage(self):
        """
        Retrieves the CPU and RAM usage of the VM using Proxmox host metrics.
        :return: Tuple of (cpu_usage, ram_usage) as percentages.
        """
        try:
            if not self.is_vm_running():
                return 0.0, 0.0  # Return zero usage if VM is not running

            # Get the current status of the VM
            command = f"qm status {self.vm_id} --verbose"
            output = self.ssh_client.execute_command(command)

            # Log the output for debugging purposes
            self.logger.debug(f"Raw output from 'qm status {self.vm_id} --verbose':\n{output}")

            # Parse RAM and CPU usage from the output
            ram_usage = self._parse_ram_usage(output)
            cpu_usage = self._parse_cpu_usage(output)

            return cpu_usage, ram_usage

        except Exception as e:
            self.logger.error(f"Failed to get resource usage for VM {self.vm_id}: {str(e)}")
            raise

    def can_scale(self):
        """
        Determines if scaling operations can be performed based on cooldown and host resource availability.
        :return: True if scaling is allowed, False otherwise.
        """
        current_time = time.time()
        if (current_time - self.last_scale_time) < self.scale_cooldown:
            self.logger.info(f"Scaling operations are on cooldown. Next scaling allowed after {int(self.scale_cooldown - (current_time - self.last_scale_time))} seconds.")
            return False

        # Additional host resource checks can be integrated here if necessary
        return True

    def scale_cpu(self, direction):
        """
        Scales the virtual CPUs (vcpus) for the VM up or down based on the given direction.
        :param direction: 'up' to increase vcpus, 'down' to decrease vcpus
        """
        if not self.can_scale():
            return

        try:
            max_cores = self._get_max_cores()
            min_cores = self._get_min_cores()
            current_cores = int(self._get_current_cores())
            current_vcpus = self._get_current_vcpus()

            # Log the scaling decision
            self.logger.info(f"Current Cores: {current_cores}, Current vCPUs: {current_vcpus}, Max Cores: {max_cores}, Min Cores: {min_cores}")

            # Scaling up
            if direction == 'up' and current_cores < max_cores:
                new_cores = current_cores + 1
                self.logger.info(f"Scaling up cores from {current_cores} to {new_cores}")
                self._set_max_cores(new_cores)

                new_vcpus = min(current_vcpus + 1, new_cores)
                if new_vcpus > current_vcpus:
                    self.logger.info(f"Scaling up vCPUs from {current_vcpus} to {new_vcpus}")
                    self._set_vcpus(new_vcpus)

            # Scaling down
            elif direction == 'down' and current_cores > min_cores:
                new_vcpus = max(current_vcpus - 1, 1)
                if new_vcpus < current_vcpus:
                    self.logger.info(f"Scaling down vCPUs from {current_vcpus} to {new_vcpus}")
                    self._set_vcpus(new_vcpus)

                new_cores = current_cores - 1
                self.logger.info(f"Scaling down cores from {current_cores} to {new_cores}")
                self._set_max_cores(new_cores)

            else:
                self.logger.info(f"No scaling action required for CPU direction '{direction}'.")

            # Update last scale time after successful scaling
            self.last_scale_time = time.time()

        except Exception as e:
            self.logger.error(f"Failed to scale CPU for VM {self.vm_id}: {str(e)}")
            # Optionally implement rollback or alerting here
            raise

    def scale_ram(self, direction):
        """
        Scales the RAM for the VM up or down based on the given direction.
        :param direction: 'up' to increase RAM, 'down' to decrease RAM
        """
        if not self.can_scale():
            return

        try:
            current_ram = int(self._get_current_ram())

            if not self._is_memory_hotplug_enabled():
                self.logger.error(f"Memory hotplug is not enabled for VM {self.vm_id}. Skipping RAM scaling.")
                return

            if direction == 'up':
                new_ram = min(current_ram + 512, self._get_max_ram())
                if new_ram > current_ram:
                    self.logger.info(f"Scaling up RAM from {current_ram} MB to {new_ram} MB")
                    if self._try_set_ram(new_ram):
                        self.logger.info(f"VM {self.vm_id} RAM scaled up to {new_ram} MB")
                    else:
                        self.logger.error(f"Failed to scale up RAM for VM {self.vm_id}")
                        return

            elif direction == 'down':
                new_ram = max(current_ram - 512, self._get_min_ram())
                if new_ram < current_ram:
                    self.logger.info(f"Scaling down RAM from {current_ram} MB to {new_ram} MB")
                    if self._try_set_ram(new_ram):
                        self.logger.info(f"VM {self.vm_id} RAM scaled down to {new_ram} MB")
                    else:
                        self.logger.error(f"Failed to scale down RAM for VM {self.vm_id}")
                        return

            else:
                self.logger.warning(f"Unknown scaling direction '{direction}' for RAM.")
                return

            # Update last scale time after successful scaling
            self.last_scale_time = time.time()

        except Exception as e:
            self.logger.error(f"Failed to scale RAM for VM {self.vm_id}: {str(e)}")
            # Optionally implement rollback or alerting here
            raise

    def _try_set_ram(self, ram):
        """
        Tries to set the RAM for the VM and handles hotplug issues with retries.
        :param ram: RAM value in MB to set
        :return: True if successful, False otherwise
        """
        retries = 3
        delay = 10  # seconds
        for attempt in range(1, retries + 1):
            try:
                command = f"qm set {self.vm_id} -memory {ram}"
                self.logger.debug(f"Executing command to set RAM: {command}")
                self.ssh_client.execute_command(command)
                self.logger.info(f"Successfully set RAM to {ram} MB for VM {self.vm_id}")
                return True
            except Exception as e:
                self.logger.error(f"Attempt {attempt}: Failed to set RAM for VM {self.vm_id}: {str(e)}")
                if attempt < retries:
                    self.logger.info(f"Retrying in {delay} seconds...")
                    time.sleep(delay)
                else:
                    self.logger.error(f"All attempts to set RAM for VM {self.vm_id} have failed.")
        return False

    def _is_memory_hotplug_enabled(self):
        """
        Checks if the memory hotplug feature is enabled for the VM.
        :return: True if memory hotplug is enabled, False otherwise
        """
        try:
            command = f"qm config {self.vm_id}"
            output = self.ssh_client.execute_command(command)
            is_enabled = 'hotplug: memory' in output or 'hotplug: 1' in output
            self.logger.debug(f"Memory hotplug enabled for VM {self.vm_id}: {is_enabled}")
            return is_enabled
        except Exception as e:
            self.logger.error(f"Failed to check hotplug status for VM {self.vm_id}: {str(e)}")
            return False

    def _parse_cpu_usage(self, output):
        """
        Retrieves and parses CPU usage information for a given VM from the Proxmox host.
        Attempts each parsing strategy in sequence until a valid result is obtained.

        :param output: Output from the `qm status` command.
        :return: CPU usage as a percentage or 0.0 if parsing fails.
        """
        import json
        import re

        # Attempt to parse JSON if the output is in JSON format
        try:
            data = json.loads(output)
            if "cpu" in data:
                cpu_usage = float(data["cpu"]) * 100  # Assuming the CPU usage is given as a fraction
                if 0 <= cpu_usage <= 100:
                    self.logger.debug(f"Parsed CPU usage from JSON: {cpu_usage}%")
                    return cpu_usage
                else:
                    self.logger.warning(f"Invalid CPU usage detected in JSON: {cpu_usage}%.")
        except json.JSONDecodeError:
            self.logger.debug("Output is not in JSON format, attempting regex parsing.")
        except (KeyError, ValueError) as e:
            self.logger.error(f"Error parsing JSON output: {str(e)}")

        # List of possible regex patterns for parsing CPU usage
        parsing_patterns = [
            r"cpu:\s*(\d+\.\d+|\d+)%",  # Match CPU usage with percentage
            r"CPU usage:\s*(\d+\.\d+|\d+)%",  # Match another CPU usage format with percentage
            r"CPU:\s*(\d+\.\d+|\d+)",  # Match CPU usage without percentage
            r"cpu:\s*(\d+\.\d+|\d+)",  # Match another format without percentage
            r"\"cpu\":\s*(\d+\.\d+|\d+)"  # Match JSON-like cpu field
        ]

        # Attempt to parse using each pattern
        for pattern in parsing_patterns:
            try:
                cpu_match = re.search(pattern, output)
                if cpu_match:
                    cpu_usage = float(cpu_match.group(1))
                    if 0 <= cpu_usage <= 100:
                        self.logger.debug(f"Parsed CPU usage using pattern '{pattern}': {cpu_usage}%")
                        return cpu_usage
                    else:
                        self.logger.warning(f"Invalid CPU usage detected: {cpu_usage}%.")
            except Exception as e:
                self.logger.error(f"Error parsing CPU usage with pattern '{pattern}': {str(e)}")

        # Fallback: Try to find any float number in the output as a last resort
        try:
            generic_match = re.search(r"(\d+\.\d+|\d+)%?", output)
            if generic_match:
                cpu_usage = float(generic_match.group(1))
                if 0 <= cpu_usage <= 100:
                    self.logger.debug(f"Parsed CPU usage using fallback pattern: {cpu_usage}%")
                    return cpu_usage
                else:
                    self.logger.warning(f"Invalid CPU usage detected in fallback parsing: {cpu_usage}%.")
        except Exception as e:
            self.logger.error(f"Error during fallback parsing: {str(e)}")

        # Log failure and return 0.0 if all methods fail
        self.logger.warning(f"All parsing methods failed for output: {output}")
        return 0.0  # Indicate failure with 0.0 as a default value

    def _parse_ram_usage(self, output):
        """
        Parses RAM usage information from the command output.
        :param output: Output from the `qm status` command.
        :return: RAM usage as a percentage.
        """
        try:
            maxmem_match = re.search(r"maxmem: (\d+)", output)
            mem_match = re.search(r"mem: (\d+)", output)
            if maxmem_match and mem_match:
                maxmem = int(maxmem_match.group(1))
                mem = int(mem_match.group(1))
                ram_usage = (mem / maxmem) * 100 if maxmem else 0.0
                self.logger.debug(f"Parsed RAM usage for VM {self.vm_id}: {ram_usage}%")
                return ram_usage
            else:
                self.logger.error(f"Could not parse RAM usage information from output: {output}")
                return 0.0
        except Exception as e:
            self.logger.error(f"Error parsing RAM usage: {str(e)}")
            raise

    def _get_current_vcpus(self):
        """
        Retrieves the current number of vCPUs allocated to the VM.
        :return: Current vCPU count
        """
        try:
            command = f"qm config {self.vm_id}"
            output = self.ssh_client.execute_command(command)

            # Log output for debugging
            self.logger.debug(f"Raw output from 'qm config {self.vm_id}':\n{output}")

            # Try to find the vcpus setting first
            data = re.search(r"vcpus: (\d+)", output)
            if data:
                vcpus = int(data.group(1))
                self.logger.debug(f"Current vCPUs for VM {self.vm_id}: {vcpus}")
                return vcpus

            # Fallback to using cores if vcpus is not explicitly defined
            self.logger.warning(f"'vcpus' not found for VM {self.vm_id}. Falling back to 'cores' value.")
            return int(self._get_current_cores())

        except Exception as e:
            self.logger.error(f"Failed to get current vCPUs for VM {self.vm_id}: {str(e)}")
            raise

    def _get_current_cores(self):
        """
        Retrieves the current number of CPU cores allocated to the VM.
        :return: Current core count
        """
        try:
            command = f"qm config {self.vm_id}"
            output = self.ssh_client.execute_command(command)

            # Log output for debugging
            self.logger.debug(f"Raw output from 'qm config {self.vm_id}':\n{output}")

            data = re.search(r"cores: (\d+)", output)
            if data:
                cores = int(data.group(1))
                self.logger.debug(f"Current CPU cores for VM {self.vm_id}: {cores}")
                return cores
            else:
                raise ValueError(f"Could not determine CPU cores for VM {self.vm_id}")
        except Exception as e:
            self.logger.error(f"Failed to get current cores for VM {self.vm_id}: {str(e)}")
            raise

    def _set_vcpus(self, vcpus):
        """
        Sets the number of vCPUs for the VM.
        :param vcpus: Number of vCPUs to set
        """
        try:
            command = f"qm set {self.vm_id} -vcpus {vcpus}"
            self.logger.debug(f"Executing command to set vCPUs: {command}")
            self.ssh_client.execute_command(command)
            self.logger.info(f"Successfully set vCPUs to {vcpus} for VM {self.vm_id}")
        except Exception as e:
            self.logger.error(f"Failed to set vCPUs for VM {self.vm_id}: {str(e)}")
            raise

    def _set_max_cores(self, cores):
        """
        Sets the maximum number of CPU cores for the VM.
        :param cores: Maximum number of CPU cores
        """
        try:
            command = f"qm set {self.vm_id} -cores {cores}"
            self.logger.debug(f"Executing command to set cores: {command}")
            self.ssh_client.execute_command(command)
            self.logger.info(f"Successfully set cores to {cores} for VM {self.vm_id}")
        except Exception as e:
            self.logger.error(f"Failed to set cores for VM {self.vm_id}: {str(e)}")
            raise

    def _get_max_cores(self):
        """
        Retrieves the maximum number of CPU cores allowed for scaling.
        :return: Max core count from the configuration.
        """
        try:
            max_cores = self.config['scaling_limits']['max_cores']
            self.logger.debug(f"Max cores from config for VM {self.vm_id}: {max_cores}")
            return max_cores
        except KeyError:
            self.logger.error("Missing 'max_cores' in scaling_limits configuration.")
            raise

    def _get_min_cores(self):
        """
        Retrieves the minimum number of CPU cores allowed for scaling.
        :return: Min core count from the configuration.
        """
        try:
            min_cores = self.config['scaling_limits']['min_cores']
            self.logger.debug(f"Min cores from config for VM {self.vm_id}: {min_cores}")
            return min_cores
        except KeyError:
            self.logger.error("Missing 'min_cores' in scaling_limits configuration.")
            raise

    def _get_current_ram(self):
        """
        Retrieves the current RAM allocated to the VM in MB.
        :return: Current RAM allocation in MB
        """
        try:
            command = f"qm config {self.vm_id}"
            output = self.ssh_client.execute_command(command)
            data = re.search(r"memory: (\d+)", output)
            if data:
                current_ram = int(data.group(1))
                self.logger.debug(f"Current RAM for VM {self.vm_id}: {current_ram} MB")
                return current_ram
            else:
                raise ValueError(f"Could not determine RAM for VM {self.vm_id}")
        except Exception as e:
            self.logger.error(f"Failed to get current RAM for VM {self.vm_id}: {str(e)}")
            raise

    def _get_max_ram(self):
        """
        Retrieves the maximum RAM allowed for scaling (in MB).
        :return: Max RAM in MB from the configuration.
        """
        try:
            max_ram = self.config['scaling_limits']['max_ram_mb']
            self.logger.debug(f"Max RAM from config for VM {self.vm_id}: {max_ram} MB")
            return max_ram
        except KeyError:
            self.logger.error("Missing 'max_ram_mb' in scaling_limits configuration.")
            raise

    def _get_min_ram(self):
        """
        Retrieves the minimum RAM allowed for scaling (in MB).
        :return: Min RAM in MB from the configuration.
        """
        try:
            min_ram = self.config['scaling_limits']['min_ram_mb']
            self.logger.debug(f"Min RAM from config for VM {self.vm_id}: {min_ram} MB")
            return min_ram
        except KeyError:
            self.logger.error("Missing 'min_ram_mb' in scaling_limits configuration.")
            raise

    def _try_set_ram(self, ram):
        """
        Tries to set the RAM for the VM and handles hotplug issues with retries.
        :param ram: RAM value in MB to set
        :return: True if successful, False otherwise
        """
        retries = 3
        delay = 10  # seconds
        for attempt in range(1, retries + 1):
            try:
                command = f"qm set {self.vm_id} -memory {ram}"
                self.logger.debug(f"Executing command to set RAM: {command}")
                self.ssh_client.execute_command(command)
                self.logger.info(f"Successfully set RAM to {ram} MB for VM {self.vm_id}")
                return True
            except Exception as e:
                self.logger.error(f"Attempt {attempt}: Failed to set RAM for VM {self.vm_id}: {str(e)}")
                if attempt < retries:
                    self.logger.info(f"Retrying in {delay} seconds...")
                    time.sleep(delay)
                else:
                    self.logger.error(f"All attempts to set RAM for VM {self.vm_id} have failed.")
        return False