OCP-on-NERC / xdmod-openshift-scripts

0 stars 1 forks source link

xdmod-openshift-scripts

xdmod-openshift-scripts contains a script that pulls metric data from an OpenShift Prometheus endpoint and formats it into a log file suitable for shredding by XDMoD.

Usage

In order to run the script, you must run oc login first.

When running the script, there are two methods of specifying the OpenShift Prometheus endpoint. The first is through an environment variable:

    $ export OPENSHIFT_PROMETHEUS_URL=<prometheus url>
    $ python openshift_metrics/openshift_prometheus_metrics.py 

The second is directly on the command line:

    $ python openshift_metrics/openshift_prometheus_metrics.py --openshift-url <prometheus url>

By default the script will pull data from the previous day. You can also specify a different date:

    $ python openshift_metrics/openshift_prometheus_metrics.py --report-date 2022-03-14

The script will generate a log file in the current directory: 2022-03-14.log. That log can then be shredded into XDMoD as follows:

    $ xdmod-shredder -f slurm -i 2022-03-14.log -r <xdmod resource>

How It Works

The openshift_prometheus_metrics.py retrieves metrics at a pod level. It does so with the following Prometheus query:

   <prometheus_url>/api/v1/query_range?query=<metric>&start=<report_date>T00:00:00Z&end=<report_date>T23:59:59Z&step=60s

This query generates samples every minute. The script will then merge consecutive samples together if their metrics are the same.

The script queries the following metrics:

The script also retrieves further information through annotations.

Each hourly sample corresponds to a single entry in the Slurm job table. That means that a pod that runs for three hours will generate three or four entries. As a result queries having to do with a job's specific start or end time will be inaccurate.

The correspondence between Slurm job columns and OpenShift information is as follows:

Slurm OpenShift Equivalent
job_id autogenerated by script
job_id_raw autogenerated by script
cluster_name openshift cluster annotation
partition_name blank
qos_name blank
account_name openshift namespace
group_name cf_pi annotation
gid_number cf_project_id annotation
user_name cf_pi annotation
uid_number cf_project_id annotation
start_time beginning of sample time
end_time end of sample time
submit_time set to start_time
eligible_time set to start_time
elapsed end_time - start_time
timelimit set to elapsed
exit_code blank
state RUNNING
nnodes 1
ncpus kube_pod_resource_request{unit="cores"}
req_cpus kube_pod_resource_limit{unit="cores"}
req_mem kube_pod_resource_limit{unit="bytes"}
req_tres cpu=,mem=
alloc_tres set to req_tres
node_list blank
job_name openshift pod name