LLNL / lmt

Lustre Monitoring Tools
GNU General Public License v2.0
67 stars 21 forks source link

Recovery status reported for any status other than COMPLETE #41

Closed ofaaland closed 4 years ago

ofaaland commented 4 years ago

In a multi-MDT system, under Lustre 2.10, all MDTs connect to each other via exports/imports. If one MDT cannot connect to one or more other MDTs, it will not service requests and will refuse connections from clients. There are other target states which may indicate action is required by an admin. These states are reflected in the "recovery_status" procfile exported by Lustre targets.

However, for LMT 3.2.7 and some releases before that, MDTs in such a state were not reported as such in ltop, because ltop checked for "RECOV" in the status field, indicting recovery, but did not check for the strings corresponding to any other states.

According lprocfs_recovery_status_seq_show() in Lustre 2.13, valid "status" values in recovery_status are (roughly):

COMPLETE             The target is active and handling requests
WAITING              The target is active but waiting for another MDT
WAITING_FOR_CLIENTS  The target is active but no clients have connected
RECOVERY             The target is active and recovering after failover
INACTIVE             The target is inactive

For individual targets, for all states other than COMPLETE, display the recov_status field instead of metric values. This makes it easier for the admin to see unhealthy targets.

At the top of the window, report the lowest-numbered MDT which is not COMPLETE or INACTIVE. If an MDT is INACTIVE, it was set that way by an admin and she likely already knows - but other states may not be expected and should be brought to her attention.

ofaaland commented 4 years ago

@morrone or @tonyhutter , can you take a look? Thanks!

ofaaland commented 4 years ago

Cherry-picked to master at 7c7266e