TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.58k stars 247 forks source link

System does not boot if rebooted during scrub #220

Open calmh opened 11 years ago

calmh commented 11 years ago

I happened to notice today that my system did not come back up if rebooted during an ongoing scrub. The cause seems to be that;

Hence the system/identity:node goes into maintenance and the bunch of services that depend on it (most of everything) don't start.

This might be a particularity to my system having a crappy disk setup and you wouldn't normally reboot during a scrub so it's pretty easy to work around. So maybe medium severity at worst... :)

For reference, the zones pool looks like this

    NAME                       STATE     READ WRITE CKSUM
    zones                      ONLINE       0     0     0
      mirror-0                 ONLINE       0     0     0
        c0t50014EE205CF84F8d0  ONLINE       0     0     0
        c0t50014EE25B7FC628d0  ONLINE       0     0     0
      mirror-1                 ONLINE       0     0     0
        c0t50014EE205CF7101d0  ONLINE       0     0     0
        c0t50014EE206561484d0  ONLINE       0     0     0
      mirror-2                 ONLINE       0     0     0
        c0t50014EE2B0D583F5d0  ONLINE       0     0     0
        c0t50014EE25B7CCB38d0  ONLINE       0     0     0
      mirror-3                 ONLINE       0     0     0
        c0t50014EE207EAD09Fd0  ONLINE       0     0     0
        c0t50014EE2B295B3F8d0  ONLINE       0     0     0
    logs
      c2t1d0p1                 ONLINE       0     0     0
      c2t2d0p1                 ONLINE       0     0     0
    cache
      c2t1d0p2                 ONLINE       0     0     0
      c2t2d0p2                 ONLINE       0     0     0

with the pool disks being WD RE4:s on a LSI HBA and the logs/cache a pair of Intel SSD:s.

ghost commented 10 years ago

Does this still happen after the change in 5419d0d6757563f5e4e88e6e04e61a89cc54d096? That change will make sysinfo -f -p take about 1/12 as long (and invoke zpool only a few times instead of dozens). So even if zpool(1M) runs slowly during a scrub, it should still be possible to complete that command in the allotted time.

calmh commented 10 years ago

Haven't tried, but it sure looks much more sane that way. :) Close and let me or someone reopen if it doesn't...

bassu commented 10 years ago

I noticed exactly the same thing yesterday. Machine was on monthly schedule, doing a 6 MB/s scrub at nearly 40% status. Things became flaky, including stuck vmadm and zoneadm which would never halt the vm, and I had to reboot the GZ. There was not enough time to fully troubleshoot for me. Boot loader stopped at "mem" checks. I had to fully power down the machine, wait a few minutes and start it up.

No hardware errors in logs, other than one failed cache device in fmdump -eV or iostat -E. But that SSD started up showing as ok in cfgadm after a hard power reset.

Really odd. My only guess is, the memory pressure caused the flaky hardware to go nuts during the scrub. Reboot halted at mem related messages so possibly some bug in illumos somewhere. Who knows. :wink:

ferebee commented 10 years ago

I was setting up a new system a few days ago, and it failed to come up just as calmh described, with system/identity:node going into maintenance due to a timeout of sysinfo -f -p.

In my case, I traced the problem to the disklist -n run by sysinfo -f -p. The system is a backup server with 26 HGST Deskstar NAS 4 TB SATA drives in a SC847 chassis, connected through the Supermicro SAS expanders to an LSI 9211-8i. (Yes, SATA over SAS expanders, sorry about that.)

The problem was, I had configured the server using a two-drive mirror, shut it down and installed the other 24 drives, which were new and unformatted. Apparently, disklist takes many seconds to look at each unformatted drive it finds. (I presume it’s scanning the disk to make sure it’s not part of a damaged zpool.)

Running disklist -n with 24 unformatted drives in my configuration takes several minutes, causing sysinfo -f -p to time out.

To recover, I pulled the empty drives, rebooted, and hot-plugged them. Better error-reporting would be helpful, considering that there appear to be multiple scenarios that can cause a timeout in sysinfo.

ghost commented 10 years ago

The problem with unlabeled/mislabeled disks is tracked under OS-1952.