Closed mattv8 closed 2 years ago
It depends on many factors. It may turn out that you have 1 option for migration, but the VM may have a CD-ROM connected or its HDD is located on the node's local storage. Then the balancer finds 1 option to improve the situation, but cannot implement it at the stage of checking the possibility of migration. The output in the "DEBUG" mode can tell more about what is happening.
In the readme, I added the requirement of a common storage for all nodes
Sorry for the delay, so I do have common storage between all nodes. In fact, they are all identical: same number of CPU's, RAM and storage. However, something strange is still happening. The algorithm sees that it needs to balance, and finds an option, but the migration doesn't end up happening and the algorithm gets stuck in an infinite loop:
root@PVE1:~# python3 ~/Proxmox-load-balancer/plb.py INFO | START Load-balancer! DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True DEBUG | Running temporary_dict DEBUG | Starting calculating INFO | Number of options = 1 DEBUG | Starting vm_migration DEBUG | VM:202 migration from PVE2 to "recipient" DEBUG | The VM:202 has [{'is_tpmstate': 0, 'replicate': 1, 'cdrom': 0, 'volid': 'shared-zfs:vm-202-disk-1', 'drivename': 'efidisk0', 'is_unused': 0, 'is_vmstate': 0, 'size': 1048576, 'referenced_in_config': 1, 'shared': 0}, {'shared': 0, 'referenced_in_config': 1, 'size': 4194304, 'is_unused': 0, 'drivename': 'tpmstate0', 'is_vmstate': 0, 'volid': 'shared-zfs:vm-202-disk-2', 'cdrom': 0, 'is_tpmstate': 1, 'replicate': 1}] INFO | Waiting 10 seconds for cluster information update DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True DEBUG | Running temporary_dict DEBUG | Starting calculating INFO | Number of options = 0 DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True DEBUG | Running temporary_dict DEBUG | Starting calculating INFO | Number of options = 0 DEBUG | Authorization attempt... DEBUG | Successful authentication. Response code: 200 DEBUG | init when creating a Cluster object DEBUG | Starting Cluster.cluster_name DEBUG | Information about the cluster name has been received. Response code: 200 DEBUG | Launching Cluster.cluster_items DEBUG | Attempt to get information about the cluster... DEBUG | Information about the cluster has been received. Response code: 200 DEBUG | Launching Cluster.cluster_hosts DEBUG | Launching Cluster.cluster_vms DEBUG | Launching Cluster.cluster_membership DEBUG | Launching Cluster.cluster_cpu DEBUG | Starting cluster_load_verification DEBUG | Starting need_to_balance_checking INFO | Need to balance: True
What do you think is stopping it up? This is Virtual Environment 7.2-3 with latest pull from this repo.
In theory:
Here it is necessary to include another algorithm that will choose a bad (but not critical) option. And then it will start working in the same mode.
Such a cluster cannot be balanced with improvements. We need to make it worse so that new options open up.
It's not difficult to implement, but I have nowhere to test it. Maybe I'll add this as an option.
Ah ha! Interesting, thanks for the explanation. I am sure this is somewhat difficult to test and implement since you must iteratively migrate and check, and migration takes time and compute resources.
I will look more into the algorithm when I have time to see if I can contribute. For now, I need to see why the API isn't starting the migration when it hits the def vm_migration(); function. It's like the API call isn't responding properly.
pvesh get /nodes/PVE2/qemu/202/migrate - will show local resources that prevent migration
pvesh create /nodes/PVE2/qemu/200/migrate --target PVE1 --online 1 - this is the CLI analog of the http request that the script makes
If this command does not start the migration, then the script will not be able to do it either.
Using this link, you can view the migration options and change them in the script to suit your needs: https://pve.proxmox.com/pve-docs/api-viewer/#/nodes/{node}/qemu/{vmid}/migrate
Changes will need to be made in this block
I hope I was able to help you
Thank you, yes, very helpful! Fine to close this as it is not an issue. I'm still testing in my environment; I'll report back if I have any more issues.
Is this expected behavior?
I have two nodes, which are already nearly balanced so this could be the reason why. See my screenshot below:![image](https://user-images.githubusercontent.com/9312603/166742131-536a3b32-8729-4731-81bb-ae21a9e15927.png)