cvk98 / Proxmox-load-balancer

Designed to constantly maintain the Proxmox cluster in balance
GNU General Public License v3.0
171 stars 20 forks source link

Logger shows "Need to balance: True" but nothing happens #1

Closed mattv8 closed 2 years ago

mattv8 commented 2 years ago

Is this expected behavior?

INFO | START ***Load-balancer!***
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update
INFO | Need to balance: True
INFO | Number of options = 1
INFO | Waiting 10 seconds for cluster information update

I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below: image

cvk98 commented 2 years ago

It depends on many factors. It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.

cvk98 commented 2 years ago

In the readme, I added the requirement of a common storage for all nodes

mattv8 commented 2 years ago

Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop:

root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py INFO | START Load-balancer! DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True DEBUG | Running temporary_dict DEBUG | Starting calculating INFO | Number of options = 1 DEBUG | Starting vm_migration DEBUG | VM:202 migration from PVE2 to "recipient" DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}] INFO | Waiting 10 seconds for cluster information update DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True DEBUG | Running temporary_dict DEBUG | Starting calculating INFO | Number of options = 0 DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True DEBUG | Running temporary_dict DEBUG | Starting calculating INFO | Number of options = 0 DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True

What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.

cvk98 commented 2 years ago

In theory:

  1. The script decides that the cluster is unbalanced
  2. Goes through all the migration options and finds one that will improve the situation.
  3. Tries to migrate the selected VM, but cannot due to local VM resources: "The VM:202 has..."
  4. Decides again that the cluster is not balanced (for some reason VM:202 no longer selects)
  5. BUT! any migration will increase sum_of_deviations. In this case, sorted_variants will be empty.

Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode. image Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up. It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.

mattv8 commented 2 years ago

Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources.

I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.

cvk98 commented 2 years ago

pvesh get /nodes/PVE2/qemu/202/migrate - will show local resources that prevent migration pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1 - this is the CLI analog of the http request that the script makes
If this command does not start the migration, then the script will not be able to do it either.
Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate

cvk98 commented 2 years ago

Changes will need to be made in this block image

cvk98 commented 2 years ago

I hope I was able to help you

mattv8 commented 2 years ago

Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.