Open calmh opened 11 years ago
Does this still happen after the change in 5419d0d6757563f5e4e88e6e04e61a89cc54d096? That change will make sysinfo -f -p take about 1/12 as long (and invoke zpool only a few times instead of dozens). So even if zpool(1M) runs slowly during a scrub, it should still be possible to complete that command in the allotted time.
Haven't tried, but it sure looks much more sane that way. :) Close and let me or someone reopen if it doesn't...
I noticed exactly the same thing yesterday. Machine was on monthly schedule, doing a 6 MB/s scrub at nearly 40% status. Things became flaky, including stuck vmadm and zoneadm which would never halt the vm, and I had to reboot the GZ. There was not enough time to fully troubleshoot for me. Boot loader stopped at "mem" checks. I had to fully power down the machine, wait a few minutes and start it up.
No hardware errors in logs, other than one failed cache device in fmdump -eV or iostat -E
. But that SSD started up showing as ok in cfgadm
after a hard power reset.
Really odd. My only guess is, the memory pressure caused the flaky hardware to go nuts during the scrub. Reboot halted at mem related messages so possibly some bug in illumos somewhere. Who knows. :wink:
I was setting up a new system a few days ago, and it failed to come up just as calmh described, with system/identity:node going into maintenance due to a timeout of sysinfo -f -p.
In my case, I traced the problem to the disklist -n run by sysinfo -f -p. The system is a backup server with 26 HGST Deskstar NAS 4 TB SATA drives in a SC847 chassis, connected through the Supermicro SAS expanders to an LSI 9211-8i. (Yes, SATA over SAS expanders, sorry about that.)
The problem was, I had configured the server using a two-drive mirror, shut it down and installed the other 24 drives, which were new and unformatted. Apparently, disklist takes many seconds to look at each unformatted drive it finds. (I presume it’s scanning the disk to make sure it’s not part of a damaged zpool.)
Running disklist -n with 24 unformatted drives in my configuration takes several minutes, causing sysinfo -f -p to time out.
To recover, I pulled the empty drives, rebooted, and hot-plugged them. Better error-reporting would be helpful, considering that there appear to be multiple scenarios that can cause a timeout in sysinfo.
I happened to notice today that my system did not come back up if rebooted during an ongoing scrub. The cause seems to be that;
Hence the system/identity:node goes into maintenance and the bunch of services that depend on it (most of everything) don't start.
This might be a particularity to my system having a crappy disk setup and you wouldn't normally reboot during a scrub so it's pretty easy to work around. So maybe medium severity at worst... :)
For reference, the zones pool looks like this
with the pool disks being WD RE4:s on a LSI HBA and the logs/cache a pair of Intel SSD:s.