Mem_used of job-level - Githubissues

ClusterCockpit / cc-backend

Web frontend and API backend server for ClusterCockpit Monitoring Framework

https://www.clustercockpit.org

MIT License

16 stars 14 forks source link

Mem_used of job-level #66

Closed Autumn-Roy closed 1 year ago

Autumn-Roy commented 1 year ago

When I get an info about a job, for example mem_used, I find that the plot was showed by node-level mem_used info, not the job-level. Then I check the demo, all plots of job 1334982 have the "node" sign on the top-left. So, does the cc-backend only display the resource usage in node-level? Or I use it by the wrong way. In our HPC clusters, many jobs can run on one node. So there maybe more than one job uses the memory and cpu. If I can only collect the resource usage in node-level, it will confuse me. Is there any tools or method can collect the resource usage in job-level?

moebiusband73 commented 1 year ago

Hi, unfortunately in the Demo there are currently only node exclusive jobs. But ClusterCockpit supports node-sharing. In this case the core granularity is shown. Of course some metrics are just not available in any other scope as node. But if a metric is available in core scope it will be also shown with this scope. To prevent confusion: there must be a job scheduler in place that isolates the jobs against each other, e.g. Slurm using cpusets. Otherwise any measurement is meaningless.

See examples for jobs sharing a node with other jobs, it this case also using gpu:

Autumn-Roy commented 1 year ago

@moebiusband73 Thank you for your reply. By the example, I saw the cpu_load plots of two jobs are the same. Isn't it because they were shared one node? By the way, when I setup the cc-metric-collector in my HPC cluster, they can't send metrics like cpu_load, ipc or flops and so on. Metrics the cc-metric-store can get are only basic metric which shown in every different type of collector's README. Should I modify the code of each type of collector to get the metrics on your reply?

TomTheBear commented 1 year ago

As one of the author of cc-metric-collector, I can address your questions.

The cpu_load for the node is provided by the loadavg component. For the load of a hardware thread, use the schedstat component. This component should be used for shared nodes. We already collect it in our center on the shared-usage systems but it is not shown in the plots yet. You can probably rename cpu_load_core to cpu_load and get the load per node and per hardware thread under the same name (with different type tags) but we havn't tried that yet.

The ipc and flops metrics are measured using the likwid component. It is probably the component with the biggest configuration space but there are helper scripts.

If you want to manipulate a metric (rename, drop, add tags, del tags, ...), you can use the MetricRouter.

Here is the router.json and collectors.json for the Alex cluster at NHR@FAU shown in the screenshot.

Autumn-Roy commented 1 year ago

@TomTheBear Thanks! I didn't read the unit of cpu_load in the plot carefully until you reply. Now I find the unit is (load 1m).