LLNL / lmt

Lustre Monitoring Tools
GNU General Public License v2.0
67 stars 21 forks source link

lmt 3.2.10 displays INACTIVE for most targets #53

Open LaHaine opened 4 years ago

LaHaine commented 4 years ago

I've upgraded from 3.2.7 to 3.2.10 and now all but one OST display the message INACTIVE 0s remaining instead of the current statistics. Only one OST went through recovery and has status: COMPLETE in recovery_status, all others have INACTIVE, but they are mounted and working fine.

I'm running Lustre 2.10.8.

ofaaland commented 4 years ago

That's the same LMT and Lustre version we are running on my test system. For the working OST, and one of the other ones, please post the output of the following: (1) systemctl status cerebrod (2) lmtmetric -m ost

thanks

LaHaine commented 4 years ago

Here's the requested output:

[miscoss14] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
   Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
   Active: active (running) since Do 2019-12-12 08:30:56 CET; 2min 53s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 124811 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
  Process: 124821 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cerebrod.service
           └─124830 /usr/sbin/cerebrod

Dez 12 08:30:56 miscoss14.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: Starting cerebrod...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: [  OK  ]
Dez 12 08:30:56 miscoss14.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: MODULE DIR = /usr...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss14] /root # lmtmetric -m ost
ost: 2;miscoss14.example.com;0.929274;98.051369;fs23-OST0002;106652818;111858688;61314606412;113584425328;111942704533504;119326363060308;287094948;247;18067;0;4;41806;131;COMPLETE 115/115 0s remaining;
[miscoss13] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
   Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
   Active: active (running) since Do 2019-12-12 08:30:55 CET; 3min 54s ago
     Docs: man:systemd-sysv-generator(8)
  Process: 11451 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
  Process: 11462 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
   CGroup: /system.slice/cerebrod.service
           └─11471 /usr/sbin/cerebrod

Dez 12 08:30:55 miscoss13.example.com systemd[1]: Stopped LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: Starting cerebrod:...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: [  OK  ]
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: MODULE DIR = /usr/...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss13] /root # lmtmetric -m ost
ost: 2;miscoss13.example.com;0.936406;97.671312;fs23-OST0001;106654593;111858688;60841203828;113584425328;109235123597312;122235912992865;274506509;247;17829;0;6;44786;16;INACTIVE  0s remaining;
ofaaland commented 4 years ago

Hi, So that "INACTIVE" is coming from the recovery_status file. On those two OSS nodes, please provide the contents of that file, like this

$ find /proc/fs/lustre/ -name recovery_status | xargs cat 
status: COMPLETE
recovery_start: 1576042037
recovery_duration: 74
completed_clients: 124/124
replayed_requests: 0
last_transno: 1129576398848
VBR: DISABLED
IR: DISABLED
LaHaine commented 4 years ago

Sure:

[miscoss13] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: INACTIVE
[miscoss14] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: COMPLETE
recovery_start: 1572261173
recovery_duration: 72
completed_clients: 115/115
replayed_requests: 6
last_transno: 12885434153
VBR: DISABLED
IR: ENABLED
ofaaland commented 4 years ago

It looks to me like that means 0 clients have connected to fs23-OST0001.

Can you check one of your lustre client nodes with "lfs check osts" and compare those same two OSTs appear? I suspect fs23-OST0002 will report "active" and fs23-OST0001 will either be missing or report "inactive".

If they both say "active", please post the following between those two OSTs:

find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done

Thanks

LaHaine commented 4 years ago

All OSTs appear just fine on the clients.

Here's the output of your command on the OSS:

[miscoss13] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0001/num_exports 247
[miscoss14] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0002/num_exports 247
ofaaland commented 4 years ago

Are the servers and clients both Lustre 2.10.8?

LaHaine commented 4 years ago

I think there was a single 2.12.3 client, all others 2.10.8.

ofaaland commented 4 years ago

Have these targets (MDTs and OSTs, on the server nodes) ever, in their lifetime, been un-mounted and then re-mounted?

I just created a new lustre 2.12.4 file system from scratch, and observe the same behavior you describe, after they have been mounted for the first time - the recovery_status file just says "status:INACTIVE". After umount and mounting again, the recovery_status files have the expected content.

LaHaine commented 4 years ago

I can't say for sure, but I guess they have been mounted several times already.

defaziogiancarlo commented 2 years ago

There is a related (and possibly the same) issue at https://jira.whamcloud.com/browse/LU-14930