TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.57k stars 246 forks source link

Crash in LZ4_uncompress_unknownOutputSize #385

Open sigxcpu76 opened 9 years ago

sigxcpu76 commented 9 years ago

Can this be related to https://www.illumos.org/issues/4936 ? I've checked the smartos-live changelog around that commit date and didn't see 4936 as being merged.

ffffff0020c74c40 ffffff04ea416080 ffffff04ec05a800   3  99                0
  PC: panicsys+0x102    CMD: zpool-zones
  stack pointer for thread ffffff0020c74c40: ffffff0020c741e0
    as_fault+0x534(fffffffffb930a18, b83aa9a12e82, ffffff0286876000, 
    ffffff0020c744c0, fffffffffb92ca15, fffffffffb92c92b)
    0xfffffffffb8482cc()
    trap+0xdb3(ffffff0020c744c0, ffffff0286876000, 1)
    0xfffffffffb8001d6()
    LZ4_uncompress_unknownOutputSize+0x147(ffffff05690d7e04, ffffff0286876000, 
    cf6, 2000)
    lz4_decompress+0x44(ffffff05690d7e00, ffffff0286876000, e00, 2000, 0)
    zio_decompress_data+0x4b(f, ffffff05690d7e00, ffffff0286876000, e00, 2000)
    zio_decompress+0x56(ffffff04faaafb78, ffffff0286876000, 2000)
    zio_pop_transforms+0x7a(ffffff04faaafb78)
    zio_done+0x21a(ffffff04faaafb78)
    zio_execute+0x78(ffffff04faaafb78)
    zio_notify_parent+0xdb(ffffff04faaafb78, ffffff0546b215a0, 1)
    zio_done+0x3a0(ffffff0546b215a0)
    zio_execute+0x78(ffffff0546b215a0)
    zio_notify_parent+0xdb(ffffff0546b215a0, ffffff052dfd7110, 1)
    zio_done+0x3a0(ffffff052dfd7110)
    zio_execute+0x78(ffffff052dfd7110)
    taskq_thread+0x2d0(ffffff04ebaf65d8)
    thread_start+8()

Also reported here https://www.illumos.org/issues/5522, but it might be useless if it is indeed 4936.

rmustacc commented 9 years ago

We did merge 4936, see

commit dfb423e13dce7c91059cc1b29429e326d6737ba3
Merge: b93eaee 58d0718
Author: Keith M Wesolowski <wesolows@foobazco.org>
Date:   Thu Jun 26 22:33:35 2014 +0000

        [illumos-gate merge]

    commit 58d0718061c87e3d647c891ec5281b93c08dba4e
        4936 lz4 could theoretically overflow a pointer with a certain input

Unfortunately, that doesn't enlighten us as to what's going on here. What does,

::status

say in this case. Can I also suggest that you reach out to the ZFS mailing list with information about the panic, which will help folks get on it sooner.

Thanks for reporting this, sorry for the trouble.

sigxcpu76 commented 9 years ago

Sorry, I missed that comment in the changelog.

Yesterday it crashed again. Another lz4 function. I didn't take the crash dump because I couldn't wait for it (I needed the server up quickly). I've also lost the other dumps (didn't bother to save them after extracting crash.0 file). I am pretty sure that it is something committed between 12/25 and 01/08 releases, because it was fine for these two weeks and it crashed twice after 01/08 upgrade.

Now it is running 12/25 again, to see if it crashes at all. Can this somehow be related to the ARC headers for L2ARC blocks footprint improvement (https://www.illumos.org/issues/5408) ? I do use an L2ARC device on "zones" pool.

rmustacc commented 9 years ago

Okay, was that in a similar code path as the dump you had, if you recall?

Regardless, this seems rather bad, can you please reach out to the zfs mailing list, which will help get the right folks looking at the problem. If you'd rather, and you're okay with making the dump available, then I can do so instead.

sigxcpu76 commented 9 years ago

It wasn't the same function, but a different one, but still starting with lz4_*.

I'll boot the 0108 release and attempt to read a virtual disk lz4 compressed (the only dataset compressed as lz4 actually). Maybe I can crash it again and save the vmdump.

sigxcpu76 commented 9 years ago

I've read that lz4 dataset with dd a few times, with and without L2ARC enabled (all vs metadata) and I've had no "luck" in crashing it again, so my idea of having some badly lz4-encoded data block is out of question. I will keep it running with 01082015 until it crashes by itself and I'll come back with the vmdump.

sigxcpu76 commented 9 years ago

It crashed again. Now this is getting stranger and stranger, because this is not ZFS related anymore. I cannot suspect hardware because there are no recorded errors in iLO log (machine has ECC memory) and current release reaches less than 24h of uptime, but 12252014 had almost two weeks without a single issue.

> $C
ffffff001e8179a0 vpanic()
ffffff001e817a50 0xfffffffffba2c2f4()
ffffff001e817ab0 kmem_slab_free+0xc1(ffffff04e5da9008, ffffff0276ec0000)
ffffff001e817b20 kmem_magazine_destroy+0x6e(ffffff04e5da9008, ffffff05568ca180, f)
ffffff001e817b60 kmem_depot_ws_reap+0x5d(ffffff04e5da9008)
ffffff001e817c20 taskq_thread+0x2d0(ffffff04e4642d28)
ffffff001e817c30 thread_start+8()
> ::status
debugging crash dump vmcore.1 (64-bit) from gen8
operating system: 5.11 joyent_20150108T111855Z (i86pc)
image uuid: (not set)
panic message: kernel heap corruption detected
dump content: kernel pages only
> ::panicinfo
             cpu                1
          thread ffffff001e817c40
         message kernel heap corruption detected
             rdi fffffffffbb5dd38
             rsi ffffff001e817940
             rdx                0
             rcx                0
              r8 ffffff04dfe7d638
              r9                0
             rax ffffff001e817960
             rbx ffffff0505a2db20
             rbp ffffff001e8179a0
             r10                1
             r11 fffffffffb800983
             r12 ffffff04dfe9db08
             r13 ffffff04e5da9008
             r14 ffffff0276ec0000
             r15 ffffff0276ec0000
          fsbase fffffd7fff102a40
          gsbase ffffff04e58a3580
              ds               4b
              es               4b
              fs                0
              gs                0
          trapno                0
             err                0
             rip fffffffffb85eb40
              cs               30
          rflags              282
             rsp ffffff001e817938
              ss               38
          gdt_hi                0
          gdt_lo         c000ffff
          idt_hi                0
          idt_lo         b000ffff
             ldt                0
            task               70
             cr0         8005003b
             cr2         bf93e2f8
             cr3         12000000
             cr4            426f8
sigxcpu76 commented 9 years ago

And here's the ::msgbuf output, that provides more information:

kernel memory allocator:
buffer freed to wrong cache
buffer was allocated from zfs_file_data_8192,
caller attempting free to zio_data_buf_8192.
buffer=ffffff0276ec0000  bufctl=ffffff0505a2db20  cache: zio_data_buf_8192

panic[cpu1]/thread=ffffff001e817c40:
kernel heap corruption detected

ffffff001e817a50 fffffffffba2c2f4 ()
ffffff001e817ab0 genunix:kmem_slab_free+c1 ()
ffffff001e817b20 genunix:kmem_magazine_destroy+6e ()
ffffff001e817b60 genunix:kmem_depot_ws_reap+5d ()
ffffff001e817c20 genunix:taskq_thread+2d0 ()
ffffff001e817c30 unix:thread_start+8 ()

syncing file systems...
 done
bcantrill commented 9 years ago

Could you run with kmem_flags set? Do this, boot into kmdb by adding a "-d" to the boot line. Then, when you're dropped into the kmdb prompt in the debugger, do this:

kmem_flags/W f

Then:

:c

You will likely die in a much more explicit way (there are clearly some bad frees occurring) -- and that dump will be very helpful for root-causing it.

       - Bryan

On Mon, Jan 12, 2015 at 11:42 AM, Alexandru Pirvulescu < notifications@github.com> wrote:

And here's the ::msgbuf output, that provides more information:

kernel memory allocator: buffer freed to wrong cache buffer was allocated from zfs_file_data_8192, caller attempting free to zio_data_buf_8192. buffer=ffffff0276ec0000 bufctl=ffffff0505a2db20 cache: zio_data_buf_8192

panic[cpu1]/thread=ffffff001e817c40: kernel heap corruption detected

ffffff001e817a50 fffffffffba2c2f4 () ffffff001e817ab0 genunix:kmem_slab_free+c1 () ffffff001e817b20 genunix:kmem_magazine_destroy+6e () ffffff001e817b60 genunix:kmem_depot_ws_reap+5d () ffffff001e817c20 genunix:taskq_thread+2d0 () ffffff001e817c30 unix:thread_start+8 ()

syncing file systems... done

— Reply to this email directly or view it on GitHub https://github.com/joyent/smartos-live/issues/385#issuecomment-69632084.

sigxcpu76 commented 9 years ago

OK, managed to start it with f flags. It required -kd to do it (-d only had shown a warning). I'm trying to dd that lz4 dataset, maybe I can trigger a crash.

sigxcpu76 commented 9 years ago

Still waiting. It is impossible to make it crash again.

sigxcpu76 commented 8 years ago

It's interesting that there are multiple reports regarding ZFS and kmem around January 2015 and nothing after that. See https://github.com/joyent/smartos-live/issues/536 and https://www.illumos.org/issues/5504

Since then I didn't have a single crash, updating almost on every SmartOS release. So it seem that this was solved by some other commit, maybe unrelated.

I think this can be closed.