Removing a ZFS pool causes system halt

GoogleCodeExporter commented 9 years ago

If installing a ZFS pool on a USB disk, and then yanking before it is 
unmounted, then the OS 
suffers from a kernel panic.

Original issue reported on code.google.com by alex.ble...@gmail.com on 24 Oct 2009 at 9:14

GoogleCodeExporter commented 9 years ago

Original comment by alex.ble...@gmail.com on 24 Oct 2009 at 9:24

Added labels: 10.5, 10.6

GoogleCodeExporter commented 9 years ago

Original comment by alex.ble...@gmail.com on 24 Oct 2009 at 9:28

GoogleCodeExporter commented 9 years ago

To clarify, even if you unmount the drive via Finder first and then remove the 
physical drive the system still 
kernel panics? So it's not failure to un-mounting first that is a problem, 
rather that the pool is still imported 
until exported via a 'zpool export' command. Semantics, but just making sure we 
are all thinking of the same 
issue.

What do we want to define as "expected behavior"?

The promise of ZFS is that in theory on disk is always consistent, but a kernel 
panic is really bad in all but the 
most extreme cases. Users really don't like kernel panics when they are just 
trying to get something done and 
doing the same thing that has always worked, like just ejecting a disk, causes 
a kernel panic. And although I 
understand the intent of the panic is to not cause further harm to the 
filesystem, I think we can find a way to 
determine if the USB drive was truly removed (via an IOKit perhaps?) or if 
there actually is a hardware failure and 
then decide which response is appropriate.

This does appear to be something that we should be able to tackle before/while 
moving to new ZFS code as this 
mostly appears to be a Mac OS X interaction issue. My thinking is that an eject 
command *should* be able to 
trigger a zpool export. At a minimum I'm thinking we should try and honor the 
failmode property on a zpool set 
by the user.

Note, all usage of the word "we" above does mean I'm also looking at ways to 
fix the issue and write the needed 
code, just looking for feedback and/or other ideas. ;)

Original comment by jason.richard.mcneil on 29 Oct 2009 at 5:16

GoogleCodeExporter commented 9 years ago

There's really a few related things. Firstly, the r72 didn't have a failmpde 
for the pool - I think it got added 
in later. Prior to that, the effect of a pool failure was to do a halt. This 
might make sense in an 
environment where there is only ZFS pools, but in a mixed-mode system (or one 
with network storage) 
it's conceivable that a network share could be used to save dirty editors etc. 

The other problem is the automounter in OSX. When you plug in a USB drive, it 
gets automounted. If it's 
HFS (or ext etc) and then yank it, you get a message saying you're a naughty 
person, don't do it again. If 
it's a ZFS drive it mounts, but when you yank it it kills the system. 

Rolling forward to a newer version of zfs wold give us the failmode but we may 
be able to grep for halt() 
in the meantime.

Original comment by alex.ble...@gmail.com on 29 Oct 2009 at 10:07

GoogleCodeExporter commented 9 years ago

Is there a reason to wait for the failmode property in 77? It seems as though 
this is major issue that will scare people away from ZFS along the lines of 
Trash support. Would it be possible to just block that IO thread until the pool 
reappears?

Does 74 have the same behavior if an internal SATA cable fails or comes loose? 
In cases of a single or multiple drive/pool failure when the system drive is 
still good, I'd think that the most commonly desired behavior would be to allow 
the application / user to recover and save any work in progress to an alternate 
location.

I don't really understand why a disappearing pool is any different from a power 
outage as far as that specific pool is concerned. Why not just return errors 
for the IO operations and then recover (possibly via a forced zpool import) 
when the pool is available again(w/ caveats of fsync ordering problems on USB 
drives) the same way one would do so after a kernel panic or a power outage? 
Why have the kernel panic?

After putting my questions in writing, I think Alex answered them in Comment 4 
already. This panic is found in MacZFS/usr/src/uts/common/fs/zfs/zio.c near 
line 918. From AlBlue's repository with the Trash fix I found the following. 
How could we exit here without panicking? Is there a good way to set the pool 
status as "crashed" here and just return with an error? 

The strange thing about this code is that the CANFAIL flag is passed and 
checked, but the comments indicate that this means we cannot fail? This isn't a 
"simple" bug with a fix we might be able to backport easily is it?

    /*
         * For I/O requests that cannot fail, panic appropriately.
         */
        if (!(zio->io_flags & ZIO_FLAG_CANFAIL)) {
            char *blkbuf;

            blkbuf = kmem_alloc(BP_SPRINTF_LEN, KM_NOSLEEP);
            if (blkbuf) {
                sprintf_blkptr(blkbuf, BP_SPRINTF_LEN,
                    bp ? bp : &zio->io_bp_copy);
            }
            panic("ZFS: %s (%s on %s off %llx: zio %p %s): error "
                "%d", zio->io_error == ECKSUM ?
                "bad checksum" : "I/O failure",
                zio_type_name[zio->io_type],
                vdev_description(vd),
                (u_longlong_t)zio->io_offset,
                zio, blkbuf ? blkbuf : "", zio->io_error);
        }

Original comment by dayenter...@gmail.com on 16 Feb 2011 at 3:21

GoogleCodeExporter commented 9 years ago

"Is there a reason to wait for the failmode property in 77?"

Well, yes, frankly. Firstly, because it's non-trivial to significantly change 
the underlying codebase (without causing problems for merging later on), and 
secondly, because MacZFS_77 is relatively speaking, just round the corner. 
There are some problems which need to be ironed out but that represents a less 
significant change than the kind of thing you are thinking of.

MacZFS has always had this restriction, and it hasn't caused the fear and scare 
that you quote so far.

Original comment by alex.ble...@gmail.com on 16 Feb 2011 at 9:58

GoogleCodeExporter commented 9 years ago

"MacZFS has always had this restriction, and it hasn't caused the fear and 
scare that you quote so far."

Please try to see this from the point of the user. "Fear" and "scare" are all 
relative. When someone tries MacZFS, all it takes is a single kernel panic on 
an operation as common as an external drive unexpectedly disconnecting before 
they start uninstalling. ZFS is supposed to bring enhanced reliability to 
storage. "Reliability" and "kernel panics" don't belong in the same sentence, 
regardless of the circumstances or judgements being made to force the kernel 
panic.

With the exception of the Mac Pro, all current and recent generation 
Macintoshes are "integrated systems", meaning that the average user can't 
realistically add internal storage. The only way we have to add large amounts 
of storage (for which people would look to ZFS to prevent bit rot and do 
RAID-Z, etc.) is to connect that storage externally via USB, Firewire or 
Thunderbolt. Either that or move to SAN, which for many is prohibitively 
expensive.

From that, one can draw the conclusion that the most common way for Mac users 
to add large amounts of storage to their system is via externally connected 
storage. Because none of the usable external storage connection mechanisms is 
based on the concept of physically secured connectors, drives becoming 
accidentally disconnected is an *far too common* phenomenon for Mac users who 
would like to use ZFS. However, kernel panics are completely unacceptable. In 
the end, its the filesystem authors making a value judgement of "what's more 
acceptable - filesystem problems or lost work in running apps". I don't believe 
that's a decision that the filesystem should be making without any alternative 
for the user. Kernels shouldn't panic, plain and simple.

Original comment by timhenr...@gmail.com on 11 Oct 2011 at 2:43

GoogleCodeExporter commented 9 years ago

tim, you are perfectly right. for mac community this is requirement, not 
something nice to have. if ZFS is supposed to work with non-removable drives 
only, then any initiative to port the project to mac has no point.

Original comment by sbernat...@gmail.com on 20 Dec 2011 at 8:53

GoogleCodeExporter commented 9 years ago

Tim and sbern
I completely agree with the both of you, I downloaded MacZFS, installed it and 
used it for a while.
I wanted to Eject the disk but the Finder wouldn't allow Me because the disk 
was in Use, WHAT?
In use while nothing was written to neither was there any app running from it.
So, I decided to just yank the cord, I almost never have to do this but I just 
wanted to without a restart.
To My surprise the whole system crashed.
No one will use MacZFS if this is not solved, heck I am even believing that 
part of the reason for Apple to drop ZFS was for this very same reason.
Solve the problem and I will be back.

Original comment by febbyman...@gmail.com on 28 Jan 2012 at 8:26

GoogleCodeExporter commented 9 years ago

Issue 102 has been merged into this issue.

Original comment by alex.ble...@gmail.com on 7 Mar 2012 at 11:28

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Having similar issues here. Created a ZFS volume on two (JBOD) drives connected 
by USB2. Volumes work. Problem: system crashes when ejecting (sometimes). 
Another strange behaviour: I eject the volume, pull the USB connector, 
everything works. When I reconnect the USB connector, the kernel panics.

Original comment by lars.ebe...@gmail.com on 23 Sep 2012 at 10:16

GoogleCodeExporter commented 9 years ago

This is a major issue.
As was outlined above 95% of macs around are integrated systems. And we do have 
to disconnect these zfs drives anyway.
I get kernel panik if:
1) eject the disk using finder
2) disconnect the cord
3) Insert usb cord -> PANIC

I can avoid panic if I:
1) eject the disk using finder
2) export via terminal
2) disconnect the cord
3) Insert usb cord

It took me quite a few kernel panics to lear that and still I suffer from that 
as my wife does not know a lot about terminal.

Original comment by y...@yan.my on 1 Mar 2013 at 12:07

BjoKaSH / maczfs-archive

Removing a ZFS pool causes system halt #3