Open sigxcpu76 opened 9 years ago
We did merge 4936, see
commit dfb423e13dce7c91059cc1b29429e326d6737ba3
Merge: b93eaee 58d0718
Author: Keith M Wesolowski <wesolows@foobazco.org>
Date: Thu Jun 26 22:33:35 2014 +0000
[illumos-gate merge]
commit 58d0718061c87e3d647c891ec5281b93c08dba4e
4936 lz4 could theoretically overflow a pointer with a certain input
Unfortunately, that doesn't enlighten us as to what's going on here. What does,
::status
say in this case. Can I also suggest that you reach out to the ZFS mailing list with information about the panic, which will help folks get on it sooner.
Thanks for reporting this, sorry for the trouble.
Sorry, I missed that comment in the changelog.
Yesterday it crashed again. Another lz4 function. I didn't take the crash dump because I couldn't wait for it (I needed the server up quickly). I've also lost the other dumps (didn't bother to save them after extracting crash.0 file). I am pretty sure that it is something committed between 12/25 and 01/08 releases, because it was fine for these two weeks and it crashed twice after 01/08 upgrade.
Now it is running 12/25 again, to see if it crashes at all. Can this somehow be related to the ARC headers for L2ARC blocks footprint improvement (https://www.illumos.org/issues/5408) ? I do use an L2ARC device on "zones" pool.
Okay, was that in a similar code path as the dump you had, if you recall?
Regardless, this seems rather bad, can you please reach out to the zfs mailing list, which will help get the right folks looking at the problem. If you'd rather, and you're okay with making the dump available, then I can do so instead.
It wasn't the same function, but a different one, but still starting with lz4_*.
I'll boot the 0108 release and attempt to read a virtual disk lz4 compressed (the only dataset compressed as lz4 actually). Maybe I can crash it again and save the vmdump.
I've read that lz4 dataset with dd a few times, with and without L2ARC enabled (all vs metadata) and I've had no "luck" in crashing it again, so my idea of having some badly lz4-encoded data block is out of question. I will keep it running with 01082015 until it crashes by itself and I'll come back with the vmdump.
It crashed again. Now this is getting stranger and stranger, because this is not ZFS related anymore. I cannot suspect hardware because there are no recorded errors in iLO log (machine has ECC memory) and current release reaches less than 24h of uptime, but 12252014 had almost two weeks without a single issue.
> $C
ffffff001e8179a0 vpanic()
ffffff001e817a50 0xfffffffffba2c2f4()
ffffff001e817ab0 kmem_slab_free+0xc1(ffffff04e5da9008, ffffff0276ec0000)
ffffff001e817b20 kmem_magazine_destroy+0x6e(ffffff04e5da9008, ffffff05568ca180, f)
ffffff001e817b60 kmem_depot_ws_reap+0x5d(ffffff04e5da9008)
ffffff001e817c20 taskq_thread+0x2d0(ffffff04e4642d28)
ffffff001e817c30 thread_start+8()
> ::status
debugging crash dump vmcore.1 (64-bit) from gen8
operating system: 5.11 joyent_20150108T111855Z (i86pc)
image uuid: (not set)
panic message: kernel heap corruption detected
dump content: kernel pages only
> ::panicinfo
cpu 1
thread ffffff001e817c40
message kernel heap corruption detected
rdi fffffffffbb5dd38
rsi ffffff001e817940
rdx 0
rcx 0
r8 ffffff04dfe7d638
r9 0
rax ffffff001e817960
rbx ffffff0505a2db20
rbp ffffff001e8179a0
r10 1
r11 fffffffffb800983
r12 ffffff04dfe9db08
r13 ffffff04e5da9008
r14 ffffff0276ec0000
r15 ffffff0276ec0000
fsbase fffffd7fff102a40
gsbase ffffff04e58a3580
ds 4b
es 4b
fs 0
gs 0
trapno 0
err 0
rip fffffffffb85eb40
cs 30
rflags 282
rsp ffffff001e817938
ss 38
gdt_hi 0
gdt_lo c000ffff
idt_hi 0
idt_lo b000ffff
ldt 0
task 70
cr0 8005003b
cr2 bf93e2f8
cr3 12000000
cr4 426f8
And here's the ::msgbuf output, that provides more information:
kernel memory allocator:
buffer freed to wrong cache
buffer was allocated from zfs_file_data_8192,
caller attempting free to zio_data_buf_8192.
buffer=ffffff0276ec0000 bufctl=ffffff0505a2db20 cache: zio_data_buf_8192
panic[cpu1]/thread=ffffff001e817c40:
kernel heap corruption detected
ffffff001e817a50 fffffffffba2c2f4 ()
ffffff001e817ab0 genunix:kmem_slab_free+c1 ()
ffffff001e817b20 genunix:kmem_magazine_destroy+6e ()
ffffff001e817b60 genunix:kmem_depot_ws_reap+5d ()
ffffff001e817c20 genunix:taskq_thread+2d0 ()
ffffff001e817c30 unix:thread_start+8 ()
syncing file systems...
done
Could you run with kmem_flags set? Do this, boot into kmdb by adding a "-d" to the boot line. Then, when you're dropped into the kmdb prompt in the debugger, do this:
kmem_flags/W f
Then:
:c
You will likely die in a much more explicit way (there are clearly some bad frees occurring) -- and that dump will be very helpful for root-causing it.
- Bryan
On Mon, Jan 12, 2015 at 11:42 AM, Alexandru Pirvulescu < notifications@github.com> wrote:
And here's the ::msgbuf output, that provides more information:
kernel memory allocator: buffer freed to wrong cache buffer was allocated from zfs_file_data_8192, caller attempting free to zio_data_buf_8192. buffer=ffffff0276ec0000 bufctl=ffffff0505a2db20 cache: zio_data_buf_8192
panic[cpu1]/thread=ffffff001e817c40: kernel heap corruption detected
ffffff001e817a50 fffffffffba2c2f4 () ffffff001e817ab0 genunix:kmem_slab_free+c1 () ffffff001e817b20 genunix:kmem_magazine_destroy+6e () ffffff001e817b60 genunix:kmem_depot_ws_reap+5d () ffffff001e817c20 genunix:taskq_thread+2d0 () ffffff001e817c30 unix:thread_start+8 ()
syncing file systems... done
— Reply to this email directly or view it on GitHub https://github.com/joyent/smartos-live/issues/385#issuecomment-69632084.
OK, managed to start it with f flags. It required -kd to do it (-d only had shown a warning). I'm trying to dd that lz4 dataset, maybe I can trigger a crash.
Still waiting. It is impossible to make it crash again.
It's interesting that there are multiple reports regarding ZFS and kmem around January 2015 and nothing after that. See https://github.com/joyent/smartos-live/issues/536 and https://www.illumos.org/issues/5504
Since then I didn't have a single crash, updating almost on every SmartOS release. So it seem that this was solved by some other commit, maybe unrelated.
I think this can be closed.
Can this be related to https://www.illumos.org/issues/4936 ? I've checked the smartos-live changelog around that commit date and didn't see 4936 as being merged.
Also reported here https://www.illumos.org/issues/5522, but it might be useless if it is indeed 4936.