Job metric query forgets `mem_bw`

ClusterCockpit / cc-backend

Web frontend and API backend server for ClusterCockpit Monitoring Framework

https://www.clustercockpit.org

MIT License

15 stars 14 forks source link

Job metric query forgets `mem_bw` #69

Open fodinabor opened 1 year ago

fodinabor commented 1 year ago

I have a weird issue where cc-backend does neither provide nor archive mem_bw.

Consider the following query (used on the job list view and the single job view, afaict): /api/jobs/metrics/1685?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_power&metric=disk_free&metric=net_bytes_in&metric=net_bytes_out&metric=nfs4_total&metric=nfs3_total&scope=node&scope=core

Returns the dump over here. Where obviously mem_bw is missing from.

On the systems view, mem_bw is indeed shown, though. /query is called with

{
  "query": "query ($cluster: String!, $nodes: [String!], $from: Time!, $to: Time!) {\n  nodeMetrics(cluster: $cluster, nodes: $nodes, from: $from, to: $to) {\n    host\n    subCluster\n    metrics {\n      name\n      metric {\n        timestep\n        scope\n        series {\n          statistics {\n            min\n            avg\n            max\n          }\n          data\n        }\n      }\n    }\n  }\n}\n",
  "variables": {
    "cluster": "test",
    "nodes": [
      "thera"
    ],
    "from": "2022-11-24T07:59:17.196Z",
    "to": "2022-11-24T08:29:17.196Z"
  }
}

The returned json can be seen in here.

Note, in an archived job from that machine, mem_bw is also missing, see here.

My cc-metric-store config contains:

"mem_bw":           { "frequency": 60, "aggregation": "sum" },

cluster.json:

        {
            "name": "mem_bw",
            "scope": "socket",
            "unit": "GB/s",
            "timestep": 60,
            "aggregation": "sum",
            "peak": 350,
            "normal": 100,
            "caution": 50,
            "alert": 10
        },

moebiusband73 commented 1 year ago

Which metric data backend do you use? The only idea I have is that mem_bw is missing already there, but you said it is shown in system view. I can only speculate, maybe the scope socket is missing? Is there anything in the log?

fodinabor commented 1 year ago

I use cc-metric-store. The collector is e.g. configured with the following, which I'd say should provide the socket scope?

"metrics": [
                    {
                        "calc": "1.0E-06*(MBOX0C0+MBOX1C0+MBOX2C0+MBOX3C0+MBOX4C0+MBOX5C0+MBOX6C0+MBOX7C0+MBOX0C1+MBOX1C1+MBOX2C1+MBOX3C1+MBOX4C1+MBOX5C1+MBOX6C1+MBOX7C1)*64.0/time",
                        "name": "mem_bw",
                        "publish": true,
                        "unit": "MB/s",
                        "scope": "socket"
                    }
                ]

Last time I checked, I didn't get any related errors, now I'm getting the following. For the /api/jobs/metrics/1636 it seems to complain about the missing data, while further down, the system's view query again is happy (except missing cpu_power, but this I currently indeed do not collect on that node).

Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /monitoring/job/1636 (200, 1.25kb, 1ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /global.css (200, 0.48kb, 0ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /uPlot.min.css (200, 0.76kb, 0ms)
Nov 25 12:08:32 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/job.css (200, 0.32kb, 0ms)
Nov 25 12:08:33 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/job.js (200, 95.86kb, 156ms)
Nov 25 12:08:34 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 2.64kb, 1ms)
Nov 25 12:08:34 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [ERROR]   partial error: cc-metric-store: failed to fetch 'mem_bw' from host 'thera': metric or host not found, failed to fetch 'mem_bw' from host 'thera': metric or host not found, failed to fetch 'm>
Nov 25 12:08:35 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [INFO]    GET /api/jobs/metrics/1636?metric=flops_any&metric=mem_bw&metric=cpu_load&metric=cpu_user&metric=mem_used&metric=clock&metric=cpu_power&metric=acc_utilization&metric=acc_mem_used&metric=acc_>
Nov 25 12:08:50 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /img/logo.png (200, 15.67kb, 1ms)
Nov 25 12:08:51 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /favicon.png (200, 10.46kb, 0ms)
Nov 25 12:08:57 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [INFO]    map[analysis_view_histogramMetrics:[flops_any mem_bw acc_utilization] analysis_view_scatterPlotMetrics:[[flops_any mem_bw] [flops_any cpu_load] [cpu_load mem_bw]] job_view_nodestats_selected>
Nov 25 12:08:57 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /monitoring/node/test/thera (200, 1.26kb, 1ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /global.css (200, 0.48kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /uPlot.min.css (200, 0.76kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/node.css (200, 0.11kb, 0ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /build/node.js (200, 70.80kb, 163ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   GET /img/logo.png (200, 15.67kb, 1ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 2.39kb, 4ms)
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [ERROR]   partial error: cc-metric-store: fetching cpu_power for node thera failed: metric or host not found
Nov 25 12:08:58 hpc-monitoring.cs.uni-saarland.de cc-backend[40627]: [DEBUG]   POST /query (200, 1.11kb, 5ms)

spacehamster87 commented 1 year ago

Hi again @fodinabor ! I am investigating this issue at the moment, and I think I am onto something.

Could you please run your job-query as /api/jobs/metrics/1685?, i.e. without any query parameters, and then check for the mem_bw field?

The good news so far is that the issue does not seem to be connected with your cluster-/metric-configuration, as this would prevent the systems-view from successfully requesting / displaying the data (At least thats my current insight).

fodinabor commented 1 year ago

Hi @spacehamster87, thanks for investigating! If I run that query without any parameters, I don't get mem_bw either, see gist

spacehamster87 commented 1 year ago

Thanks for the feedback! Should've found your gist-post myself actually ... It was worth a shot.

The underlying query of the systems/status view and the direct jobs/metrics/{id} both use GraphQL afaict, but with slightly different methods in the backend, which might be the reason for the two different outcomes/results.

Namely LoadNodeData() @ metricdata/metricdata.go:211 for systems/status and LoadData() @ metricdata/metricdata.go:78 for the jobs/metrics/{id}-API.

I'll dig some more.

spacehamster87 commented 1 year ago

After more digging, reproducing the error/case, and more logging in cc-metric-store, I think I have pinpointed the problem:

If querying in the Job-View, the smallest granularity, as defined in the cluster.json, will be requested - In mem_bws case: socket. But if the requested granularity cannot be provided by cc-metric-store it will return the beforementioned error instead.
The systems-view always and only requests the always available node granularity for each hostname, thus, returning data also for mem_bw

This also happens when the archiving starts and requests the latest data to write, in which mem_bw also will not return data.

To verify this, please try the following:

1) Set mem_bw granularity in the cluster.json to node, restart cc-backend, then check the query result. 2) With mem_bw granularity set to socket in the cluster.json query the API with : /api/jobs/metrics/1685?&metric=mem_bw&scope=socket

The fact that cc-metric-store returns an error and no usable data if the requested granularity does not match should definately be handled via a new issue in the respective repo

fodinabor commented 1 year ago

Jup, setting it to the node level works for new jobs.

Since job 1685 is archived (and it did not archive mem_bw) the query just returns empty. For another (currently running job), with cluster.json's mem_bw scope set to socket, /api/jobs/metrics/8129?metric=mem_bw&scope=socket returns:

{"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found, failed to fetch 'mem_bw' from host 'thor': metric or host not found"}}

With cluster.json's mem_bw scope set to node the same query returns:

{"data":{"jobMetrics":[{"name":"mem_bw","metric":{"unit":"GB/s","scope":"node","timestep":60,"series":[{"hostname":"thor","statistics":{"min":0.60,"avg":0.71,"max":1.50},"data":[0.80,0.60,1.00,0.60,0.60,0.60,0.60,0.60,0.90,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,1.10,0.60,0.60,0.60,0.60,0.60,0.60,0.70,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.60,0.90,0.70,0.70,0.70,0.90,0.70,0.70,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.10,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.80,0.80,0.70,1.00,0.70,0.70,0.70,0.80,0.60,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,1.10,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.20,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.80,0.70,0.70,1.00,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,1.50,0.60,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,1.00,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.70,0.80,0.70,null,null,null,0.70,0.70,0.70,0.70,0.70,0.70,0.67,0.69,0.66,0.66,0.67,0.68,0.85,0.66,0.67,0.68,0.67,0.66,0.66,0.68,0.65]}],"statisticsSeries":null}}]},"error":null}

Might that be related to us not setting the hwthreads? I.e. /query's job/resources/0/hwthreads is null.

We don't set that since we (currently) do not pin jobs to threads. Alternatively, we of course could just set that to a list of all hwthreads, getting node granularity after all... 🤷🏼

Edit: interestingly, setting mem_bw scope to socket and just querying the scope node /api/jobs/metrics/8129?metric=mem_bw&scope=node also fails:

{"data":null,"error":{"message":"cc-metric-store: failed to fetch 'mem_bw' from host 'thor': metric or host not found"}}

spacehamster87 commented 1 year ago

Hi again! With this new information, the issue seems to be connected to your config after all: The topology-configuration in cluster.json should resemble this example, especially regarding the arrays for node, socket, memoryDomain and hwthread. The latter is still mentioned as core, but was renamed a while back, the linked example seems out of date ...

So for now, I see the following options to solve this issue:

1) Use the node scope for mem_bw - Which is more of a workaround than a solution. 2) Add hwthreads to the topology and re-check the configuration files of the whole stack. We are happy to have a look as well, if you can provide your files. 3) Check which granularity is sent by the cc-metric-collector by directly querying the cc-metric-store API.

As for your edit: As soon as a smaller scope than node is set in the config files, cc-backend will try to request that scope, and then calculate the "actually requested" scope from the returned data. This is probably why socket as a requested scope for mem_bw fails, as it requires hwthread-data.

fodinabor commented 1 year ago

Some comments:

re 2.: we have the topology in our cluster.json, but we don't set hwthreads when we /start /stop the jobs. So I guess there are two options here: achieve same level of workaround as in 1. by sending /start a list of $[0,numthreads[$ as hwthreads, 2. consider pinning the threads and just sending that info to CC.. re 3.: the mem_bw metrics are collected on socket level (for some AMD nodes the LIKWID group converter apparently set it to hwthread, I changed it to socket now..), double checked that a few days ago.

We're still using core but the schema also mentions core not hwthread? https://github.com/ClusterCockpit/cc-backend/blob/master/pkg/schema/schemas/cluster.schema.json#L167

spacehamster87 commented 1 year ago

Hi @fodinabor,

Sorry for this Issue to be stalled for some time now! As you've seen, we've been working hard to reach a solid release state.

With the recent 1.0.0 Release, and todays minor 1.1.0 update, I therefore wanted to ask if the issue still persists, or if you have found a solution on your side in the meantime.

fodinabor commented 1 year ago

Hi @spacehamster87 , so far, we are using mem_bw only with level being node... Were there changes that might make it worth to retest with granularity socket?