Open sebastian-luna-valero opened 2 years ago
One solution would be to use Prometheus and Grafana on the k8s cluster where openEO platform is deployed.
I have a couple of questions for @jdries
cc: @Jaapel @backeb @zbenta @maricaantonacci
On our cluster, we have a few tools to monitor usage. The main tool is Prometheus (deployed via https://github.com/prometheus-operator/kube-prometheus). Together with Prometheus, a Grafana and Alertmanager instance is provisioned.
When we want to look more closely into the specific cost of a job, namespace, ... , we consult kubecost. It allows us to define custom prices for CPU, memory and storage. Underlying, Kubecost uses Prometheus for its metrics.
Thanks @tcassaert
So there is nothing preventing the gathering CPU and RAM usage for openEO jobs, right?
Any job you run is just a regular collection of pods (driver and executor(s)). The difficult part is correctly filtering the usage per job.
Not that I am an expert, but have you explored this? https://github.com/LucaCanali/sparkMeasure
Thanks to @enolfc for pointing in the right direction!
In our notebooks deployment, we use prometheus and annotations to distinguish pods from one another with metricAnnotationsAllowList
see kube-state-metrics values.yaml.
If you can annotate all the related pod to a job, then this could help.
Hi @tcassaert
Any feedback from your side?
I personally don't have any experience with the sparkMeasure
tool you linked above. From reading the readme, I'm not sure if it's targeted against Kubernetes deployments.
We're currently leveraging kubecost with a script that talks to its API to retrieve any cost metrics for a certain job:
import requests
def get_total_cost(url, namespace, pod, window):
params = (
('aggregate', 'namespace'),
('filterNamespaces', namespace),
('filterPods', pod),
('window', window),
('accumulate', 'true'),
)
total_cost = requests.get(url, params=params).json()
print(total_cost['data'][0][namespace]['totalCost'])
def main():
url = "http://kubecost.kube-dev.vgt.vito.be/model/allocation"
namespace = "spark-jobs"
pod = "job-06d808a4-jenkins-driver"
window = "30d" get_total_cost(url, namespace, pod, window)
if __name__ in "__main__":
main()
From reading the readme, I'm not sure if it's targeted against Kubernetes deployments.
Apologies, I didn't dig deep enough then.
We're currently leveraging kubecost with a script that talks to its API to retrieve any cost metrics for a certain job
Great!
In a previous meeting [1] @jdries wasn't sure whether CPU accounting was in place for openEO jobs. A bit of context: this is to profile the notebook of this repository, and compare performance across different deployments.
I believe that this issue can now be closed?
[1] https://confluence.egi.eu/display/CSCALE/2022-10-04+Aquamonitor+monthly+progress+meeting
@sebastian-luna-valero I think we now have the method written down, but it does depend on having this kubecost service available, and I don't think this is the case by default. @zbenta @maricaantonacci is it ok for you to set up this additional service? (Or to include it in the tosca template?)
Thanks, FYI: https://www.kubecost.com/pricing
Good morning everyone,
We are still waiting for someone to validate that our endpoint is working correctly. Since no one has validated it, we thought the best way to do it was by trying to use the jupyter notebook that Jaap wrote, but without success. It looks like OpenEO needs zookeeper to work properly, we have deployed zookeeper and have found an issue with cinder, that we are trying to overcome. Only after solving the issue we will look into kubecost.
Cumprimentos / Best Regards, Zacarias Benta
On Wed, Oct 26, 2022 at 8:16 AM Sebastian Luna-Valero < @.***> wrote:
Thanks, FYI: https://www.kubecost.com/pricing
— Reply to this email directly, view it on GitHub https://github.com/c-scale-community/use-case-aquamonitor/issues/30#issuecomment-1291602198, or unsubscribe https://github.com/notifications/unsubscribe-auth/AM7L5R3HGHH5MLP2MCB35P3WFDLFNANCNFSM6AAAAAAQ5HPJMU . You are receiving this because you were mentioned.Message ID: @.***>
@tcassaert can we check if using https://github.com/kubernetes/kube-state-metrics would work as an alternative to kubecost?
Hi,
@tcassaert could you please let us know your thoughts about @jdries' question above?
xref: https://github.com/c-scale-community/use-case-aquamonitor/issues/26#issuecomment-1471562477
@sebastian-luna-valero I will check it out and let you know if it can support our usecase.
@Jaapel could you describe what exactly you want to have as metrics?
I am looking to gather CPU and memory usage, as well as total time spent on a job, to create a story like: "The algorithm was applied to area X, both at the INCD backend as well as the VITO backend. The algorithm took X:XX hours to run, consuming just XCpu Hours, and consumed X GB memory for this time divided over X worker nodes" If you have a better suggestion on how to reflect job performance and report on it I am open to suggestions of course!
Ideally we would also say something about how many GB of data was processed in the Job, but it is not critical for reporting I feel.
@sebastian-luna-valero the kube-state-metrics is not a tool to get specific metrics about running pods/jobs/... As the name already hints, its' more used to report states of objects in Kubernetes. Such as how many pods are currently ready to serve requests.
From my point of view, we have the most chance of getting accurate metrics with https://github.com/prometheus-operator/kube-prometheus and https://github.com/opencost/opencost, or a combination of both.
thanks @tcassaert
@enolfc mentioned above a combination of kube-state-metrics
and prometheus
, in order to annotate/account for usage. I am not familiar with this myself, sorry.
cc: @zbenta
Would something like this do the trick?
We've lauched job 'j-83bb00fec3f748a795d19afd8a0babec' and got the cpu usage of each pod:
As well as the consumed memory:
That's really useful to see the current live usage of a pod, but I don't think you can get some reporting out of it? As in: This job used in total so much cores and so much memory.
After more investigation, opencost itself is quite limited in usage. We'd need the kubecost distribution of opencost, but the free plan is a little limited regarding how long metrics are stored (15 days).
Looks like we can find most useful metrics in Prometheus, but I still have to create a really useful dashboard.
We setup opencost but are unable to access the UI. We create the ingress and the url returns a blank page. We also tried using the clusterip and curl to "see" something, but we get nothing interesting.
[centos@k8s-cscale-k8s-master-nf-1 opencost]$ curl 10.233.42.13:9090
<!DOCTYPE html><html><head><meta content="text/html;charset=utf-8" http-equiv="Content-Type"><meta content="utf-8" http-equiv="encoding"><link rel="icon" href="/favicon.7eff484d.ico"><link rel="stylesheet" href="/index.89ba2f7f.css"></head><body> <div id="app" class="page-container"></div> <script src="/index.2a66e5fc.js" type="module"></script><script src="/index.ad60b673.js" nomodule="" defer></script> </body></html>[centos@k8s-cscale-k8s-master-nf-1 opencost]$
We then proceeeded to use opencost as another prometheus metric exporter and downloaded this dashboard:(https://grafana.com/grafana/dashboards/8670-cluster-cost-utilization-metrics/)
Some of the parts of the dashboard seem to work, but we haven't a clue what is the relevant information. Actually we have no clue what is the infomation that you guys so desperately need to get from our deployment.
Thanks @zbenta For context, see https://github.com/c-scale-community/use-case-aquamonitor/issues/30#issuecomment-1471679665
Thanks @zbenta For context, see #30 (comment)
I believe that the information requested can be seen in the dashboard we presented at: https://github.com/c-scale-community/use-case-aquamonitor/issues/30#issuecomment-1473483523
You can see the total amount of cpu used and the memory a specific job consumed.
In order to compare the results of performance tests in Aquamonitor we need to measure CPU usage with openEO.
Let's use this issue to discuss how this can be done.