Open gainsley opened 1 year ago
How we get the total number of GPUs on the cloudlet might depend on the type of the IaaS underneath:
For example on openstack we can use openstack hypervisor list
and openstack hypervisor show <name>
commands to count devices that match pci_passthrough:alias='t4gpu:1'
names.
This would be different for other types of IaaS
Another approach to total number of GPUs available on a given cloudlet is explicit configuration during cloudlet creation. This field can later be updated if more GPUs are added. However there is some degree of manual work that is required to keep this updated. In addition for some openstack instances that we share across different setups for testing this would not accurately represent the totals, but might allow us to actually partition it if we need to.
With respect to current usage, there are also couple of ways:
Script to count current gpu usage:
#!/bin/bash
# Read all flavor IDs into an array
readarray -t flavor_ids < <(openstack flavor list -f value -c ID)
# Iterate over the array and check each flavor
for flavor_id in "${flavor_ids[@]}"; do
if openstack flavor show "$flavor_id" | grep -q "pci_passthrough:alias='t4gpu:1'"; then
# List all instances using this flavor
openstack server list --flavor "$flavor_id" -f value -c ID -c Name
fi
done
We should (and do have already) both. We already track vcpu/mem/disk usage based on our database objects, plus we query the underlying infra for what it thinks in there. In controller-data.go, there is vmResourceActionEnd() which is called after every change to the infra, which collects the resources in use as reported by the infra API. We just need to add gpus to this.
Outside of that, we probably want a command like "analyze cloudlet resources", that would report platform-specific resources (i.e. in the case of Openstack: servers, ports, security groups, etc), their limits (if any), creation time, and their consumed basic resources if any (i.e. vcpu/mem/disk/gpu) that are found in use but not accounted for by our database records. This will let us determine if they're valid (from another CRM sharing the infra) or dangling. We may also need to provide APIs to be able to delete said resources if needed in case they are dangling. This is different from what controller-data is currently doing, because that's tracking infra-independent resources, but this would report infra-specific resources.
Currently there's no way on any of the infras (but particularly Openstack) to see how many gpu resources are available, and how many are being used.
This makes it hard to determine if "no valid host found" errors are due to insufficient gpu resources or some other issue.