CCI-MOC / openshift-usage-scripts

0 stars 3 forks source link

Create a way to differentiate between Lenovo A100 GPU usage and other A100 usage in OpenShift #40

Closed joachimweyl closed 6 months ago

joachimweyl commented 7 months ago

Motivation

We need to provide invoice data to Lenovo for only their A100s so that we pay them for the time their GPUs are used. Closely related to this issue.

Completion Criteria

invoicing data has a way to track the difference between Lenovo and non-Lenovo GPUs. Or we generate a separate invoice for Lenovo that only shows their data.

Description

Completion dates

Desired - 2024-02-27 Required - 2024-04-05

joachimweyl commented 7 months ago

@naved001 you mentioned "prometheus returns the name of the hypervisor so that’s good." it sounds like this would require having a list of all Lenovo hypervisors to confirm if they are Lenovo. Is that correct?

joachimweyl commented 7 months ago

If the 5 estimate was too much based on the ease of getting the hypervisor name please feel free to shrink the estimate.

naved001 commented 7 months ago

it sounds like this would require having a list of all Lenovo hypervisors to confirm if they are Lenovo. Is that correct?

exactly, or if the name of the hypervisor has "lenovo" (or some other identifying info) in it then we don't have to maintain a list.

msdisme commented 7 months ago

Do we need/have a second issue for capturing openshift multi-instance GPU costs? https://www.redhat.com/en/blog/multi-instance-gpu-support-with-the-gpu-operator-v1.7.0?extIdCarryOver=true&sc_cid=701f2000001OH6fAAG

joachimweyl commented 7 months ago

@msdisme https://github.com/CCI-MOC/ops-issues/issues/1039 is for testing MIG.

joachimweyl commented 7 months ago

As we currently have no other A100s in OpenShift this issue is a lower priority.

joachimweyl commented 6 months ago

awaiting OpenShift testing to find out the exact compute node name then we can differentiate.

joachimweyl commented 6 months ago

It sounds like the way to differentiate is to create a list of compute nodes that are in this 1st batch of Lenovo loans and track it that way. Work to put this into use will be done in this issue.