ClusterCockpit / cc-backend

Web frontend and API backend server for ClusterCockpit Monitoring Framework
https://www.clustercockpit.org
MIT License
15 stars 14 forks source link

Core Metrics Footprint is taking the wrong values into account #246

Closed oscarminus closed 6 months ago

oscarminus commented 6 months ago

The tab "Core Metrics Footprint" at the job detail page isn't working for multi node jobs. For example a job using 1000 Cores over 8 nodes results in cpu_load "120.94 / 1000". So either the load of all nodes must be accumulated or the dividor must be the maximum cpu count of one node.

I would suggest the first option, to deal with uneven distribution of the tasks over several nodes.

spacehamster87 commented 6 months ago

I have implemented a new option job_view_showFootprint in the config.json to disable display of the footprint in the jobviews of all users. It is available on the hotfix branch and will be merged later.

However, the underlying issue is not caused by the footprint component, but rather by the use-case / configuration of CC: It seems you have your thresholds defined as the sum of all allocated nodes, thus 1000 on the right side of the footprint.

The value is, by default, defined as the avg of all allocated nodes, thus maxes out at a single nodes maximum of 120.

To achieve the requested result, you need to change the aggregation for cpu_load from the default avg to sum in the clusters' cluster.json metricConfig section:

   "metricConfig": [
        {
            "name": "cpu_load",
            "unit": {
                "base": ""
            },
            "scope": "node",
            "aggregation": "sum",
            "timestep": 60,
            "peak": "<Your Threshold Peak>",
            "normal": "<Your Threshold Normal>",
            "caution": "<Your Threshold Caution>",
            "alert": "<Your Threshold Alert>"
        },
       [...]
spacehamster87 commented 6 months ago

I changed the logic of the footprint data routine as follows: