Gathering CPU usage for openEO jobs

sebastian-luna-valero commented 1 year ago

In order to compare the results of performance tests in Aquamonitor we need to measure CPU usage with openEO.

Let's use this issue to discuss how this can be done.

sebastian-luna-valero commented 1 year ago

One solution would be to use Prometheus and Grafana on the k8s cluster where openEO platform is deployed.

I have a couple of questions for @jdries

What tools have your team already considered/discarded?
What type of kubernetes resources are created when a user (e.g. the Aquamonitor Use Case) submit a job to openEO?

cc: @Jaapel @backeb @zbenta @maricaantonacci

tcassaert commented 1 year ago

On our cluster, we have a few tools to monitor usage. The main tool is Prometheus (deployed via https://github.com/prometheus-operator/kube-prometheus). Together with Prometheus, a Grafana and Alertmanager instance is provisioned.

When we want to look more closely into the specific cost of a job, namespace, ... , we consult kubecost. It allows us to define custom prices for CPU, memory and storage. Underlying, Kubecost uses Prometheus for its metrics.

sebastian-luna-valero commented 1 year ago

Thanks @tcassaert

So there is nothing preventing the gathering CPU and RAM usage for openEO jobs, right?

tcassaert commented 1 year ago

Any job you run is just a regular collection of pods (driver and executor(s)). The difficult part is correctly filtering the usage per job.

sebastian-luna-valero commented 1 year ago

Not that I am an expert, but have you explored this? https://github.com/LucaCanali/sparkMeasure

Thanks to @enolfc for pointing in the right direction!

enolfc commented 1 year ago

In our notebooks deployment, we use prometheus and annotations to distinguish pods from one another with metricAnnotationsAllowList see kube-state-metrics values.yaml.

If you can annotate all the related pod to a job, then this could help.

sebastian-luna-valero commented 1 year ago

Hi @tcassaert

Any feedback from your side?

tcassaert commented 1 year ago

I personally don't have any experience with the sparkMeasure tool you linked above. From reading the readme, I'm not sure if it's targeted against Kubernetes deployments.

We're currently leveraging kubecost with a script that talks to its API to retrieve any cost metrics for a certain job:

import requests

def get_total_cost(url, namespace, pod, window):
    params = (
        ('aggregate', 'namespace'),
        ('filterNamespaces', namespace),
        ('filterPods', pod),
        ('window', window),
        ('accumulate', 'true'),
    )

    total_cost = requests.get(url, params=params).json()
    print(total_cost['data'][0][namespace]['totalCost'])

def main():
    url = "http://kubecost.kube-dev.vgt.vito.be/model/allocation"
    namespace = "spark-jobs"
    pod = "job-06d808a4-jenkins-driver"
    window = "30d"    get_total_cost(url, namespace, pod, window)

if __name__ in "__main__":
    main()

sebastian-luna-valero commented 1 year ago

From reading the readme, I'm not sure if it's targeted against Kubernetes deployments.

Apologies, I didn't dig deep enough then.

We're currently leveraging kubecost with a script that talks to its API to retrieve any cost metrics for a certain job

Great!

In a previous meeting [1] @jdries wasn't sure whether CPU accounting was in place for openEO jobs. A bit of context: this is to profile the notebook of this repository, and compare performance across different deployments.

I believe that this issue can now be closed?

[1] https://confluence.egi.eu/display/CSCALE/2022-10-04+Aquamonitor+monthly+progress+meeting

jdries commented 1 year ago

@sebastian-luna-valero I think we now have the method written down, but it does depend on having this kubecost service available, and I don't think this is the case by default. @zbenta @maricaantonacci is it ok for you to set up this additional service? (Or to include it in the tosca template?)

sebastian-luna-valero commented 1 year ago

Thanks, FYI: https://www.kubecost.com/pricing

zbenta commented 1 year ago

Good morning everyone,

We are still waiting for someone to validate that our endpoint is working correctly. Since no one has validated it, we thought the best way to do it was by trying to use the jupyter notebook that Jaap wrote, but without success. It looks like OpenEO needs zookeeper to work properly, we have deployed zookeeper and have found an issue with cinder, that we are trying to overcome. Only after solving the issue we will look into kubecost.

Cumprimentos / Best Regards, Zacarias Benta

On Wed, Oct 26, 2022 at 8:16 AM Sebastian Luna-Valero < @.***> wrote:

Thanks, FYI: https://www.kubecost.com/pricing

— Reply to this email directly, view it on GitHub https://github.com/c-scale-community/use-case-aquamonitor/issues/30#issuecomment-1291602198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM7L5R3HGHH5MLP2MCB35P3WFDLFNANCNFSM6AAAAAAQ5HPJMU . You are receiving this because you were mentioned.Message ID: @.***>

jdries commented 1 year ago

@tcassaert can we check if using https://github.com/kubernetes/kube-state-metrics would work as an alternative to kubecost?

sebastian-luna-valero commented 1 year ago

Hi,

@tcassaert could you please let us know your thoughts about @jdries' question above?

xref: https://github.com/c-scale-community/use-case-aquamonitor/issues/26#issuecomment-1471562477

tcassaert commented 1 year ago

@sebastian-luna-valero I will check it out and let you know if it can support our usecase.

tcassaert commented 1 year ago

@Jaapel could you describe what exactly you want to have as metrics?

Jaapel commented 1 year ago

I am looking to gather CPU and memory usage, as well as total time spent on a job, to create a story like: "The algorithm was applied to area X, both at the INCD backend as well as the VITO backend. The algorithm took X:XX hours to run, consuming just XCpu Hours, and consumed X GB memory for this time divided over X worker nodes" If you have a better suggestion on how to reflect job performance and report on it I am open to suggestions of course!

Jaapel commented 1 year ago

Ideally we would also say something about how many GB of data was processed in the Job, but it is not critical for reporting I feel.

tcassaert commented 1 year ago

@sebastian-luna-valero the kube-state-metrics is not a tool to get specific metrics about running pods/jobs/... As the name already hints, its' more used to report states of objects in Kubernetes. Such as how many pods are currently ready to serve requests.

From my point of view, we have the most chance of getting accurate metrics with https://github.com/prometheus-operator/kube-prometheus and https://github.com/opencost/opencost, or a combination of both.

sebastian-luna-valero commented 1 year ago

thanks @tcassaert

@enolfc mentioned above a combination of kube-state-metrics and prometheus, in order to annotate/account for usage. I am not familiar with this myself, sorry.

cc: @zbenta

zbenta commented 1 year ago

Would something like this do the trick?

We've lauched job 'j-83bb00fec3f748a795d19afd8a0babec' and got the cpu usage of each pod:

As well as the consumed memory:

tcassaert commented 1 year ago

That's really useful to see the current live usage of a pod, but I don't think you can get some reporting out of it? As in: This job used in total so much cores and so much memory.

tcassaert commented 1 year ago

After more investigation, opencost itself is quite limited in usage. We'd need the kubecost distribution of opencost, but the free plan is a little limited regarding how long metrics are stored (15 days).

Looks like we can find most useful metrics in Prometheus, but I still have to create a really useful dashboard.

zbenta commented 1 year ago

We setup opencost but are unable to access the UI. We create the ingress and the url returns a blank page. We also tried using the clusterip and curl to "see" something, but we get nothing interesting.

[centos@k8s-cscale-k8s-master-nf-1 opencost]$ curl 10.233.42.13:9090
<!DOCTYPE html><html><head><meta content="text/html;charset=utf-8" http-equiv="Content-Type"><meta content="utf-8" http-equiv="encoding"><link rel="icon" href="/favicon.7eff484d.ico"><link rel="stylesheet" href="/index.89ba2f7f.css"></head><body> <div id="app" class="page-container"></div> <script src="/index.2a66e5fc.js" type="module"></script><script src="/index.ad60b673.js" nomodule="" defer></script> </body></html>[centos@k8s-cscale-k8s-master-nf-1 opencost]$

We then proceeeded to use opencost as another prometheus metric exporter and downloaded this dashboard:(https://grafana.com/grafana/dashboards/8670-cluster-cost-utilization-metrics/)

Some of the parts of the dashboard seem to work, but we haven't a clue what is the relevant information. Actually we have no clue what is the infomation that you guys so desperately need to get from our deployment.

sebastian-luna-valero commented 1 year ago

Thanks @zbenta For context, see https://github.com/c-scale-community/use-case-aquamonitor/issues/30#issuecomment-1471679665

zbenta commented 1 year ago

Thanks @zbenta For context, see #30 (comment)

I believe that the information requested can be seen in the dashboard we presented at: https://github.com/c-scale-community/use-case-aquamonitor/issues/30#issuecomment-1473483523

You can see the total amount of cpu used and the memory a specific job consumed.

c-scale-community / use-case-aquamonitor

Gathering CPU usage for openEO jobs #30