LLNL / lmt

Lustre Monitoring Tools
GNU General Public License v2.0
67 stars 21 forks source link

Correct error handling bugs #51

Closed ofaaland closed 4 years ago

ofaaland commented 4 years ago

(1) Remove err_exit() calls from libproc

Earlier commit
* f984897 handle lustre version file containing bare version

introduced a bug by adding err_exit() calls to
libproc/{mdt,ost,osc,router}.c call paths.

This caused the cerebro metric modules _get_metric_value() calls to
return -1 (error) when the metrics were run on a system without the
relevant lustre subsystem running.  An example is a Lustre server after boot but
before starting MDTs or OSTs.  This is incorrect, as the subsystem not
running is a valid state - the module can return no metric, at least for
string type metrics, and indicate success.

In addition, returning -1 triggers a bug in cerebrod, and cerebrod exits.

Remove the calls, and update the test expected output to reflect the errors
reported now that the functions are running to completion.

(2) Report errno string for failure of _packed_lustre_version()

When _packed_lustre_version() returns failure, report the failure
including the errno string.

Add two trees with invalid/missing lustre version files, to ensure that
errors are reported with the correct error string.
ofaaland commented 4 years ago

Chris and Tony, Sorry to bug you again, but the earlier commits introduced a bug and I'd like to get the fix into TOSS. Please take a look. thanks

ofaaland commented 4 years ago

I found a mistake already. Closing this PR, I'll open a new one when I've got it right.