infra (all): show current/available gpu resources

gainsley commented 1 year ago

Currently there's no way on any of the infras (but particularly Openstack) to see how many gpu resources are available, and how many are being used.

This makes it hard to determine if "no valid host found" errors are due to insufficient gpu resources or some other issue.

levshvarts commented 9 months ago

How we get the total number of GPUs on the cloudlet might depend on the type of the IaaS underneath:

For example on openstack we can use openstack hypervisor list and openstack hypervisor show <name> commands to count devices that match pci_passthrough:alias='t4gpu:1' names. This would be different for other types of IaaS

Another approach to total number of GPUs available on a given cloudlet is explicit configuration during cloudlet creation. This field can later be updated if more GPUs are added. However there is some degree of manual work that is required to keep this updated. In addition for some openstack instances that we share across different setups for testing this would not accurately represent the totals, but might allow us to actually partition it if we need to.

With respect to current usage, there are also couple of ways:

Quickest and simplest is to track the application Instances we create ourselves. This doesn't rely on the underlying openstack calls, and as a result instant. This would work nicely if the total number of GPUs is configured is part of the cloudlet object(see above about specifying the total on the cloudlet creation). This type of tracking would also allow us to reject the creation of an application instance instantly and with a much more human-readable error. However, since this doesn't rely on the underlying infra, we might be off with our calculations if there are multiple places from where the VMs can be created on the underlying infrastructure
A more precise way is to use IaaS-specific API to count all VM instances that match a specific flavor. For openstack See below for an example script. The downside is that on the current openstack setups it takes ~1min to run, so running it as part of each application instance won't work. Running it periodically might be a reasonable solution if we accept that sometimes we will still hit windows where our information is not current and we don't allow a creation of an application instance, even if a gpu is available.

Script to count current gpu usage:

#!/bin/bash

# Read all flavor IDs into an array
readarray -t flavor_ids < <(openstack flavor list -f value -c ID)

# Iterate over the array and check each flavor
for flavor_id in "${flavor_ids[@]}"; do
  if openstack flavor show "$flavor_id" | grep -q "pci_passthrough:alias='t4gpu:1'"; then
    # List all instances using this flavor
    openstack server list --flavor "$flavor_id" -f value -c ID -c Name
  fi
done

gainsley commented 9 months ago

We should (and do have already) both. We already track vcpu/mem/disk usage based on our database objects, plus we query the underlying infra for what it thinks in there. In controller-data.go, there is vmResourceActionEnd() which is called after every change to the infra, which collects the resources in use as reported by the infra API. We just need to add gpus to this.

Outside of that, we probably want a command like "analyze cloudlet resources", that would report platform-specific resources (i.e. in the case of Openstack: servers, ports, security groups, etc), their limits (if any), creation time, and their consumed basic resources if any (i.e. vcpu/mem/disk/gpu) that are found in use but not accounted for by our database records. This will let us determine if they're valid (from another CRM sharing the infra) or dangling. We may also need to provide APIs to be able to delete said resources if needed in case they are dangling. This is different from what controller-data is currently doing, because that's tracking infra-independent resources, but this would report infra-specific resources.

edgexr / edge-cloud-platform

infra (all): show current/available gpu resources #174