TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.58k stars 246 forks source link

Kernel Panic caused by GP fault when a disk was pulled that was not been used. #485

Open stevenwilliamson opened 9 years ago

stevenwilliamson commented 9 years ago

Server panicked when a drive was pulled. The drive that was pulled had failed some time earlier. A new disk was inserted in to a spare bay and a "zpool replace" was used to replace the drive.

So when the drive was pulled from the server it was no longer part of the zones pool.

Below is some data from an mdb session. If required i could upload the coredump so Joyent have access.

mdb -k unix.0 vmcore.0
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs sd lofs idm mpt_sas crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]
> $C
ffffff01e863a710 mutex_owner_running+0xd()
ffffff01e863a750 spa_async_request+0x44(ffffff4447a3e9e0, 2)
ffffff01e863a790 vdev_disk_off_notify+0x77(ffffff43202e0b88, fffffffffbc997c0, ffffff431f509d40, 0)
ffffff01e863a820 ldi_invoke_notify+0x1b1(ffffff431ebe2550, fffffffffffffffe, 0, fffffffffbb5b228, 0)
ffffff01e863a860 e_ddi_offline_notify+0x75(ffffff431ebe2550)
ffffff01e863a8d0 devi_detach_node+0x52(ffffff431ebe2550, 40001)
ffffff01e863a940 ndi_devi_offline+0x89(ffffff431ebe2550, 1)
ffffff01e863a9a0 mptsas_offline_lun+0xd3(ffffff431ebe27f8, ffffff431ebe2550, 0, 1)
ffffff01e863aa00 mptsas_offline_target+0xb8(ffffff431ebe27f8, ffffff5dd58ca440)
ffffff01e863aaf0 mptsas_handle_topo_change+0x3e9(ffffffcd33c88800, ffffff431ebe27f8)
ffffff01e863ab60 mptsas_handle_dr+0x184(ffffffcd33c88800)
ffffff01e863ac20 taskq_thread+0x2d0(ffffff431d3c4d68)
ffffff01e863ac30 thread_start+8()
> ::status
debugging crash dump vmcore.0 (64-bit) from phy3-sw1-ash
operating system: 5.11 joyent_20150622T171240Z (i86pc)
image uuid: (not set)
panic message: BAD TRAP: type=d (#gp General protection) rp=ffffff01e863a5b0 addr=0
dump content: kernel pages only
>

> ::panicinfo
             cpu                1
          thread ffffff01e863ac40
         message BAD TRAP: type=d (#gp General protection) rp=ffffff01e863a5b0 addr=0
             rdi ffffff4447a3f0e0
             rsi                0
             rdx              800
             rcx                2
              r8                0
              r9                a
             rax ffffff01e863ac40
             rbx ffffff4447a3f0e0
             rbp ffffff01e863a710
             r10 ffffff01e863a5b0
             r11 1fffffe888f47000
             r12                1
             r13 1fffffe888f47000
             r14                0
             r15                0
          fsbase fffffd7fff071a40
          gsbase ffffff42eb902580
              ds               4b
              es               4b
              fs                0
              gs                0
          trapno                d
             err                0
             rip fffffffffb85f37d
              cs               30
          rflags            10206
             rsp ffffff01e863a6a8
              ss               38
          gdt_hi                0
          gdt_lo         7000ffff
          idt_hi                0
          idt_lo         6000ffff
             ldt                0
            task               70
             cr0         8005003b
             cr2          26179a8
             cr3         12000000
             cr4            426f8

> fffffffffb85f37d::dis
mutex_owner_running:            movq   (%rdi),%r11
mutex_owner_running+3:          andq   $0xfffffffffffffff8,%r11
mutex_owner_running+7:          cmpq   $0x0,%r11
mutex_owner_running+0xb:        je     +0x10    <mutex_owner_running+0x1d>
mutex_owner_running+0xd:        movq   0xd8(%r11),%r8
mutex_owner_running+0x14:       movq   0x18(%r8),%r9
mutex_owner_running+0x18:       cmpq   %r11,%r9
mutex_owner_running+0x1b:       je     +0x4     <mutex_owner_running+0x21>
mutex_owner_running+0x1d:       xorq   %rax,%rax
mutex_owner_running+0x20:       ret
mutex_owner_running+0x21:       movq   %r8,%rax
mutex_owner_running+0x24:       ret
0xfffffffffb85f395:             nopl   (%rax)
mutex_owner_running_critical_size:      sbbb   %al,(%rax)
mutex_owner_running_critical_size+2:    .byte   0

Looking at where it actually panicked

> mutex_owner_running::dis
mutex_owner_running:            movq   (%rdi),%r11
mutex_owner_running+3:          andq   $0xfffffffffffffff8,%r11
mutex_owner_running+7:          cmpq   $0x0,%r11
mutex_owner_running+0xb:        je     +0x10    <mutex_owner_running+0x1d>
mutex_owner_running+0xd:        movq   0xd8(%r11),%r8
mutex_owner_running+0x14:       movq   0x18(%r8),%r9
mutex_owner_running+0x18:       cmpq   %r11,%r9
mutex_owner_running+0x1b:       je     +0x4     <mutex_owner_running+0x21>
mutex_owner_running+0x1d:       xorq   %rax,%rax
mutex_owner_running+0x20:       ret
mutex_owner_running+0x21:       movq   %r8,%rax
mutex_owner_running+0x24:       ret

How do we get to mutex_owner_running from mutex_enter, and how can it cause a GP fault ? mutex_owner_running does not have any arguments passed in so what populates 0xd8(%r11).

Am I barking up the wrong tree, do i need to dig this low, is it that the mutex structure is invalid that is passed in as part of the spa_async_request function call in the stack trace ?

ffffff01e863a710 mutex_owner_running+0xd()
ffffff01e863a750 spa_async_request+0x44(ffffff4447a3e9e0, 2)
ffffff01e863a790 vdev_disk_off_notify+0x77(ffffff43202e0b88, fffffffffbc997c0, ffffff431f509d40, 0)
ffffff01e863a820 ldi_invoke_notify+0x1b1(ffffff431ebe2550, fffffffffffffffe, 0, fffffffffbb5b228, 0)
ffffff01e863a860 e_ddi_offline_notify+0x75(ffffff431ebe2550)
ffffff01e863a8d0 devi_detach_node+0x52(ffffff431ebe2550, 40001)
ffffff01e863a940 ndi_devi_offline+0x89(ffffff431ebe2550, 1)
ffffff01e863a9a0 mptsas_offline_lun+0xd3(ffffff431ebe27f8, ffffff431ebe2550, 0, 1)
ffffff01e863aa00 mptsas_offline_target+0xb8(ffffff431ebe27f8, ffffff5dd58ca440)
ffffff01e863aaf0 mptsas_handle_topo_change+0x3e9(ffffffcd33c88800, ffffff431ebe27f8)
ffffff01e863ab60 mptsas_handle_dr+0x184(ffffffcd33c88800)
ffffff01e863ac20 taskq_thread+0x2d0(ffffff431d3c4d68)
ffffff01e863ac30 thread_start+8()

> ffffff4447a3e9e0::print spa_t
{
    spa_name = [ '@', '\305', 'c', '\034', 'C', '\377', '\377', '\377', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', 'p', '\364', '\210', '\350', '\377',
 '\377', '\037', '\0', '\0', '\0', '\0', '\0', '\0', '\0', '\0', ... ]
    spa_comment = 0x1fffffe888f47000
    spa_avl = {
        avl_child = [ 0, 0 ]
        avl_pcb = 0xffffff4400000001
    }
    .......
    snipped for clarity
    spa_async_lock = {
        _opaque = [ 0x1fffffe888f47000 ]
    }

Is that a mutex data structure and if so how do i check it's valid ?

I naively tried the below, but I'm learning MDB and Illumos internals as I go and I am probably missing a part of the picture. (I would appreciate a run down of how this is triaged if it get's looked at)

> 0x1fffffe888f47000::print mutex_t
mdb: failed to look up type mutex_t: no symbol corresponds to address
rmustacc commented 9 years ago

Thanks for reporting this.

The kernel type for a mutex isn't a mutex_t, but is actually a kmutex_t. However, mdb also has a built in dcmd, so you should be able to take that address you have for a lock and run ::mutex on it, which will tell you more information about it. Though the spa_name member looks a little suspect there. What it be possible to make the dump available, we should be able to dig in a bit more and get a bit more information about where the spa went wrong.

Other things you can do is use the ::whatis dcmd on the address o the spa_t. I suspect that it might be garbage or have been freed for some reason based on the name.

rmustacc commented 9 years ago

From consulting with some folks in #illumos, it appears that this is the same issue as discussed in this thread: http://comments.gmane.org/gmane.os.illumos.devel/21354

stevenwilliamson commented 9 years ago

OK ::whatis on the spa_t only yields: ffffff4447a3e9e0 is freed from anon_cache

I'm guessing this means another kernel thread freed the memory.

Running ::mutex on the mutex that we panicked on shows

ffffff4447a3f0e0::mutex ADDR TYPE HELD MINSPL OLDSPL WAITERS ffffff4447a3f0e0 adapt 1fffffe888f47000 - - no

But im guessing this is just old state given the spa_t has been freed and is of no help.

It sounds like it may well be the bug you have linked to then. Is the patch mentioned in the post something that could be pulled into Illumos ?.

If there is any doubt it is this issue i can still upload the dump somewhere.

stevenwilliamson commented 8 years ago

Just had another crash looks like a repeat of this issue. Disk was pulled and system crashed.

[root@phy0-sw1-ash (ash) /var/crash/volatile]# mdb -k unix.0 vmcore.0
mdb: warning: dump is from SunOS 5.11 joyent_20150622T171240Z; dcmds and macros may not match kernel implementation
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci ufs ip hook neti sockfs arp usba stmf_sbd stmf zfs mm sd lofs idm mpt_sas crypto random cpc logindmux ptm kvm sppp nsmb smbsrv nfs ]
> $C
ffffff01ea3216c0 vdev_disk_off_notify+0x45(ffffff4320864588, fffffffffbc997c0, ffffff431f2469c0, 0)
ffffff01ea321750 ldi_invoke_notify+0x1b1(ffffff431edd8010, fffffffffffffffe, 0, fffffffffbb5b228, 0)
ffffff01ea321790 e_ddi_offline_notify+0x75(ffffff431edd8010)
ffffff01ea321800 devi_detach_node+0x52(ffffff431edd8010, 41000)
ffffff01ea321870 ndi_devi_offline+0x89(ffffff431edd8010, 1000)
ffffff01ea321900 i_mdi_pi_state_change+0x511(ffffff431ebf4140, 4, 0)
ffffff01ea321940 mdi_pi_offline+0x39(ffffff431ebf4140, 0)
ffffff01ea3219a0 mptsas_offline_lun+0x71(ffffff431edd8560, 0, ffffff431ebf4140, 1)
ffffff01ea321a00 mptsas_offline_target+0x185(ffffff431edd8560, ffffff44b7383900)
ffffff01ea321af0 mptsas_handle_topo_change+0x3e9(ffffff43aa749790, ffffff431edd8560)
ffffff01ea321b60 mptsas_handle_dr+0x184(ffffff43aa749790)
ffffff01ea321c20 taskq_thread+0x2d0(ffffff431d454d68)
ffffff01ea321c30 thread_start+8()
> ::panicinfo
             cpu                3
          thread ffffff01ea321c40
         message BAD TRAP: type=e (#pf Page fault) rp=ffffff01ea3215a0 addr=38 occurred in module "zfs" due to a NULL pointer dereference
             rdi ffffff431f2469c0
             rsi fffffffff7d6555e
             rdx fffffffff7d6555f
             rcx                0
              r8 ffffff431edd8010
              r9                0
             rax                0
             rbx ffffff431f2469c0
             rbp ffffff01ea3216c0
             r10 ffffff01ea321430
             r11                0
             r12                0
             r13                0
             r14 fffffffffffffffe
             r15 fffffffffbb5b228
          fsbase fffffd7ffc566a40
          gsbase ffffff42eb8fda80
              ds               4b
              es               4b
              fs                0
              gs                0
          trapno                e
             err                2
             rip fffffffff7d39335
              cs               30
          rflags            10246
             rsp ffffff01ea321690
              ss               38
          gdt_hi                0
          gdt_lo         7000ffff
          idt_hi                0
          idt_lo         6000ffff
             ldt                0
            task               70
             cr0         8005003b
             cr2               38
             cr3         12000000
             cr4            426f8
> fffffffff7d39335::dis -a
fffffffff7d3930e                movq   %rdx,-0x18(%rbp)
fffffffff7d39312                movq   %rsi,-0x10(%rbp)
fffffffff7d39316                movq   0x60(%rdx),%r12
fffffffff7d3931a                call   +0x3cb9d51       <ldi_ev_get_type>
fffffffff7d3931f                movq   $0xfffffffff7d6554d,%rsi
fffffffff7d39326                movq   %rax,%rdi
fffffffff7d39329                call   +0x3b4f4b2       <strcmp>
fffffffff7d3932e                testl  %eax,%eax
fffffffff7d39330                jne    +0x35    <vdev_disk_off_notify+0x77>
fffffffff7d39332                movq   %rbx,%rdi
fffffffff7d39335                movl   $0x1,0x38(%r12)
fffffffff7d3933e                call   +0xdcd   <vdev_disk_close>
fffffffff7d39343                movq   0x58(%rbx),%rdi
fffffffff7d39347                movq   %rbx,%rsi
fffffffff7d3934a                call   -0x1614f <zfs_post_remove>
fffffffff7d3934f                movq   0x58(%rbx),%rdi
fffffffff7d39353                movl   $0x1,0x240(%rbx)
fffffffff7d3935d                movl   $0x2,%esi
fffffffff7d39362                call   -0x36477 <spa_async_request>
fffffffff7d39367                xorl   %eax,%eax
fffffffff7d39369                movq   (%rsp),%rbx

Crashed due to null pointer, %r12 = 0 so we try to access memory address 0x38.

Tying this up with the code in vdec_disk.c it looks like this is because the dvd pointer is null/incorrect. We crash setting dvd->vd_ldi_offline = B_TRUE

    if (strcmp(ldi_ev_get_type(ecookie), LDI_EV_OFFLINE) != 0)
        return (LDI_EV_SUCCESS);

    /*
     * All LDI handles must be closed for the state change to succeed, so
     * call on vdev_disk_close() to do this.
     *
     * We inform vdev_disk_close that it is being called from offline
     * notify context so it will defer cleanup of LDI event callbacks and
     * freeing of vd->vdev_tsd to the offline finalize or a reopen.
     */
    dvd->vd_ldi_offline = B_TRUE;
    vdev_disk_close(vd);

dvd is a vdev_tsd grabbed from a vdev_t that is passed in. Looking at the pointer passed in for vdev_t everything is null

> ffffff431f2469c0::print vdev_t
{
    vdev_id = 0
    vdev_guid = 0
    vdev_guid_sum = 0
    vdev_orig_guid = 0
    vdev_asize = 0
    vdev_min_asize = 0
    vdev_max_asize = 0
    vdev_ashift = 0
    .......Snipped for clarity

Looking at the address it looks like it has been freed by another thread

> ffffff431f2469c0::whatis
ffffff431f2469c0 is freed from kmem_alloc_1600

So it appears there is a race condition it looks like something else is also calling vdev_disk_close(). But I can't see what else is calling vdev_close.

I'm not sure the patch linked to previously would resolve this specific issue, given a disk was just pulled no new disk added. I can't see why vdev_open would be been called so i don't think it's race between vdev_open/close.

rmustacc commented 8 years ago

@jclulow Could you take a look at this in the context of other ZFS hotplug restructuring that we've done?