Open LaHaine opened 4 years ago
That's the same LMT and Lustre version we are running on my test system. For the working OST, and one of the other ones, please post the output of the following: (1) systemctl status cerebrod (2) lmtmetric -m ost
thanks
Here's the requested output:
[miscoss14] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
Active: active (running) since Do 2019-12-12 08:30:56 CET; 2min 53s ago
Docs: man:systemd-sysv-generator(8)
Process: 124811 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
Process: 124821 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/cerebrod.service
└─124830 /usr/sbin/cerebrod
Dez 12 08:30:56 miscoss14.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: Starting cerebrod...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: [ OK ]
Dez 12 08:30:56 miscoss14.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:56 miscoss14.example.com cerebrod[124821]: MODULE DIR = /usr...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss14] /root # lmtmetric -m ost
ost: 2;miscoss14.example.com;0.929274;98.051369;fs23-OST0002;106652818;111858688;61314606412;113584425328;111942704533504;119326363060308;287094948;247;18067;0;4;41806;131;COMPLETE 115/115 0s remaining;
[miscoss13] /root # systemctl status cerebrod
● cerebrod.service - LSB: cerebrod startup script
Loaded: loaded (/etc/rc.d/init.d/cerebrod; bad; vendor preset: disabled)
Active: active (running) since Do 2019-12-12 08:30:55 CET; 3min 54s ago
Docs: man:systemd-sysv-generator(8)
Process: 11451 ExecStop=/etc/rc.d/init.d/cerebrod stop (code=exited, status=0/SUCCESS)
Process: 11462 ExecStart=/etc/rc.d/init.d/cerebrod start (code=exited, status=0/SUCCESS)
CGroup: /system.slice/cerebrod.service
└─11471 /usr/sbin/cerebrod
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Stopped LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Starting LSB: cerebrod ...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: Starting cerebrod:...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: [ OK ]
Dez 12 08:30:55 miscoss13.example.com systemd[1]: Started LSB: cerebrod s...
Dez 12 08:30:55 miscoss13.example.com cerebrod[11462]: MODULE DIR = /usr/...
Hint: Some lines were ellipsized, use -l to show in full.
[miscoss13] /root # lmtmetric -m ost
ost: 2;miscoss13.example.com;0.936406;97.671312;fs23-OST0001;106654593;111858688;60841203828;113584425328;109235123597312;122235912992865;274506509;247;17829;0;6;44786;16;INACTIVE 0s remaining;
Hi, So that "INACTIVE" is coming from the recovery_status file. On those two OSS nodes, please provide the contents of that file, like this
$ find /proc/fs/lustre/ -name recovery_status | xargs cat
status: COMPLETE
recovery_start: 1576042037
recovery_duration: 74
completed_clients: 124/124
replayed_requests: 0
last_transno: 1129576398848
VBR: DISABLED
IR: DISABLED
Sure:
[miscoss13] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: INACTIVE
[miscoss14] /root # find /proc/fs/lustre/ -name recovery_status | xargs cat
status: COMPLETE
recovery_start: 1572261173
recovery_duration: 72
completed_clients: 115/115
replayed_requests: 6
last_transno: 12885434153
VBR: DISABLED
IR: ENABLED
It looks to me like that means 0 clients have connected to fs23-OST0001.
Can you check one of your lustre client nodes with "lfs check osts" and compare those same two OSTs appear? I suspect fs23-OST0002 will report "active" and fs23-OST0001 will either be missing or report "inactive".
If they both say "active", please post the following between those two OSTs:
find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
Thanks
All OSTs appear just fine on the clients.
Here's the output of your command on the OSS:
[miscoss13] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0001/num_exports 247
[miscoss14] /root # find /proc/fs/lustre/obdfilter/ -name num_exports | while read fname; do echo $fname $(cat $fname); done
/proc/fs/lustre/obdfilter/fs23-OST0002/num_exports 247
Are the servers and clients both Lustre 2.10.8?
I think there was a single 2.12.3 client, all others 2.10.8.
Have these targets (MDTs and OSTs, on the server nodes) ever, in their lifetime, been un-mounted and then re-mounted?
I just created a new lustre 2.12.4 file system from scratch, and observe the same behavior you describe, after they have been mounted for the first time - the recovery_status file just says "status:INACTIVE". After umount and mounting again, the recovery_status files have the expected content.
I can't say for sure, but I guess they have been mounted several times already.
There is a related (and possibly the same) issue at https://jira.whamcloud.com/browse/LU-14930
I've upgraded from 3.2.7 to 3.2.10 and now all but one OST display the message INACTIVE 0s remaining instead of the current statistics. Only one OST went through recovery and has status: COMPLETE in recovery_status, all others have INACTIVE, but they are mounted and working fine.
I'm running Lustre 2.10.8.