Closed oscarminus closed 8 months ago
I have implemented a new option job_view_showFootprint
in the config.json
to disable display of the footprint in the jobviews of all users. It is available on the hotfix branch and will be merged later.
However, the underlying issue is not caused by the footprint component, but rather by the use-case / configuration of CC: It seems you have your thresholds defined as the sum
of all allocated nodes, thus 1000 on the right side of the footprint.
The value is, by default, defined as the avg
of all allocated nodes, thus maxes out at a single nodes maximum of 120.
To achieve the requested result, you need to change the aggregation for cpu_load
from the default avg
to sum
in the clusters' cluster.json
metricConfig
section:
"metricConfig": [
{
"name": "cpu_load",
"unit": {
"base": ""
},
"scope": "node",
"aggregation": "sum",
"timestep": 60,
"peak": "<Your Threshold Peak>",
"normal": "<Your Threshold Normal>",
"caution": "<Your Threshold Caution>",
"alert": "<Your Threshold Alert>"
},
[...]
I changed the logic of the footprint data routine as follows:
aggregation == avg
aggregation == sum
job.hwthreads / hwthreads-by-conf
The tab "Core Metrics Footprint" at the job detail page isn't working for multi node jobs. For example a job using 1000 Cores over 8 nodes results in cpu_load "120.94 / 1000". So either the load of all nodes must be accumulated or the dividor must be the maximum cpu count of one node.
I would suggest the first option, to deal with uneven distribution of the tasks over several nodes.