ClusterCockpit / cc-backend

Web frontend and API backend server for ClusterCockpit Monitoring Framework
https://www.clustercockpit.org
MIT License
14 stars 12 forks source link

Archiving a job fails if individual metrics are missing in the metric store #267

Closed oscarminus closed 2 weeks ago

oscarminus commented 1 month ago

We have noticed for some time that jobs are not archived if a metric is not available in the Metric Store. In this case, only the metadata of the job is archived, but all of the metrics, even the existing ones, remain empty.

We occasionally have nodes on which the lustre collector no longer provides any data. These jobs can then not be archived.

[ERROR] /srv/cc-backend/internal/repository/job.go:534: archiving job (dbid: 4813845) failed: METRICDATA/CCMS > Errors: failed to fetch 'lustre_close' from host 'n2cn0593': metric or host not found, failed to fetch 'lustre_open' from ...

cc-backend version is 1.3.0

spacehamster87 commented 2 weeks ago

We fixed a blocking error return which prevented the data load for running jobs to be sucessfully returned as soon as one expected metric was missing from said data.

If the load function now reports missing metrics, but is otherwise non-empty, it will write those to the logs as WARNING instead.

Only if the resulting data-array is empty while also reporting errors this error is now returned as intended.