Core Metrics Footprint is taking the wrong values into account

oscarminus commented 8 months ago

The tab "Core Metrics Footprint" at the job detail page isn't working for multi node jobs. For example a job using 1000 Cores over 8 nodes results in cpu_load "120.94 / 1000". So either the load of all nodes must be accumulated or the dividor must be the maximum cpu count of one node.

I would suggest the first option, to deal with uneven distribution of the tasks over several nodes.

spacehamster87 commented 8 months ago

I have implemented a new option job_view_showFootprint in the config.json to disable display of the footprint in the jobviews of all users. It is available on the hotfix branch and will be merged later.

However, the underlying issue is not caused by the footprint component, but rather by the use-case / configuration of CC: It seems you have your thresholds defined as the sum of all allocated nodes, thus 1000 on the right side of the footprint.

The value is, by default, defined as the avg of all allocated nodes, thus maxes out at a single nodes maximum of 120.

To achieve the requested result, you need to change the aggregation for cpu_load from the default avg to sum in the clusters' cluster.json metricConfig section:

   "metricConfig": [
        {
            "name": "cpu_load",
            "unit": {
                "base": ""
            },
            "scope": "node",
            "aggregation": "sum",
            "timestep": 60,
            "peak": "<Your Threshold Peak>",
            "normal": "<Your Threshold Normal>",
            "caution": "<Your Threshold Caution>",
            "alert": "<Your Threshold Alert>"
        },
       [...]

spacehamster87 commented 8 months ago

I changed the logic of the footprint data routine as follows:

Now ignores job exclusivity
If metric is set to aggregation == avg
- Use thresholds as configured for one node
- Use average values
If metric is set to aggregation == sum
- Use modified thresholds: Multiply configured threshold for one node by job.hwthreads / hwthreads-by-conf
- Accounts for shared jobs with partial resources and multi-node jobs alike
- Sum value based on reported node averages

ClusterCockpit / cc-backend

Core Metrics Footprint is taking the wrong values into account #246