TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.57k stars 245 forks source link

kernel panic on lx_brand futex() #918

Closed ingenthr closed 4 years ago

ingenthr commented 4 years ago

I have a system which was generally running fine on a 2008 vintage SmartOS image, but updating a package in a zone lead to a kernel panic with futex().

I subsequently updated to 20200311 and I continue to see the same kernel panic. It's fairly easy to reproduce.

From messages in mdb:

vmcs revision_id = 10
kvm_lapic_reset: vcpu=fffffe4349b1f000, id=3, base_msr= fee00000 PRIx64 base_address=fee00000
kvm_lapic_reset: vcpu=fffffe4349b18000, id=3, base_msr= fee00000 PRIx64 base_address=fee00000
vmcs revision_id = 10
vmcs revision_id = 10
unhandled wrmsr: 0xf66090 data 6c5498
unhandled wrmsr: 0xf66090 data 6c5498
unhandled wrmsr: 0xf66090 data 6c5498
unhandled wrmsr: 0xf66090 data 6c5498
unhandled wrmsr: 0xef25fd08 data 1a4
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0x0 data 0
unhandled wrmsr: 0x0 data 0
vcpu 3 received sipi with vector # 10
vcpu 2 received sipi with vector # 10
vcpu 1 received sipi with vector # 10
kvm_lapic_reset: vcpu=fffffe4349b1f000, id=3, base_msr= fee00800 PRIx64 base_address=fee00000
kvm_lapic_reset: vcpu=fffffe4349b26000, id=2, base_msr= fee00800 PRIx64 base_address=fee00000
kvm_lapic_reset: vcpu=fffffe4349aff000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
vcpu 1 received sipi with vector # 98 
kvm_lapic_reset: vcpu=fffffe435aa7f000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled wrmsr: 0xf66090 data 6c5498 
unhandled wrmsr: 0xf66090 data 6c5498 
unhandled wrmsr: 0xf66090 data 6c5498 
unhandled wrmsr: 0xf66090 data 6c5498 
unhandled wrmsr: 0xef25fd08 data 1a4  
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
vcpu 3 received sipi with vector # 10 
vcpu 1 received sipi with vector # 10 
kvm_lapic_reset: vcpu=fffffe4349b18000, id=3, base_msr= fee00800 PRIx64 base_address=fee00000
vcpu 2 received sipi with vector # 10 
kvm_lapic_reset: vcpu=fffffe4349af8000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
kvm_lapic_reset: vcpu=fffffe4349b2d000, id=2, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled wrmsr: 0x0 data 0           
unhandled wrmsr: 0x0 data 0           
unhandled rdmsr: 0x140                
vcpu 1 received sipi with vector # 98 
kvm_lapic_reset: vcpu=fffffe4349aff000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled rdmsr: 0x140                
vcpu 2 received sipi with vector # 98 
kvm_lapic_reset: vcpu=fffffe4349b26000, id=2, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled rdmsr: 0x140                
vcpu 3 received sipi with vector # 98 
kvm_lapic_reset: vcpu=fffffe4349b1f000, id=3, base_msr= fee00800 PRIx64 base_address=fee00000
unhandled rdmsr: 0x140                
unhandled rdmsr: 0x756e6547           
unhandled wrmsr: 0x40020140 data 0    
unhandled rdmsr: 0x756e6547           
unhandled wrmsr: 0x40020140 data 0    
unhandled rdmsr: 0x756e6547           
unhandled wrmsr: 0x40020140 data 0    
unhandled rdmsr: 0x756e6547           
unhandled wrmsr: 0x40020140 data 0    
unhandled rdmsr: 0x756e6547           
unhandled wrmsr: 0x40020140 data 0    
unhandled rdmsr: 0x756e6547           
unhandled wrmsr: 0x40020140 data 0    
vcpu 1 received sipi with vector # 1  
kvm_lapic_reset: vcpu=fffffe4349af8000, id=1, base_msr= fee00800 PRIx64 base_address=fee00000
vcpu 2 received sipi with vector # 1  
kvm_lapic_reset: vcpu=fffffe4349b2d000, id=2, base_msr= fee00800 PRIx64 base_address=fee00000
vcpu 3 received sipi with vector # 1  
kvm_lapic_reset: vcpu=fffffe4349b18000, id=3, base_msr= fee00800 PRIx64 base_address=fee00000
xsvc0 at root: space 0 offset 0       
xsvc0 is /xsvc@0,0                    
sd0 at scsa2usb0: target 0 lun 0      
sd0 is /pci@0,0/pci1458,5006@1a/hub@1/storage@2/disk@0,0
pseudo-device: llc10                  
llc10 is /pseudo/llc1@0               
pseudo-device: ramdisk1024            
ramdisk1024 is /pseudo/ramdisk@1024   
pseudo-device: ucode0                 
ucode0 is /pseudo/ucode@0             
pseudo-device: dcpc0                  
dcpc0 is /pseudo/dcpc@0               
pseudo-device: fbt0                   
fbt0 is /pseudo/fbt@0                 
pseudo-device: profile0               
profile0 is /pseudo/profile@0         
pseudo-device: lockstat0              
lockstat0 is /pseudo/lockstat@0       
pseudo-device: sdt0                   
sdt0 is /pseudo/sdt@0                 
pseudo-device: systrace0              
systrace0 is /pseudo/systrace@0       
device pciclass,030000@0(display#0) keeps up device sd@0,0(disk#0), but the former is not power managed
pseudo-device: fcp0                   
fcp0 is /pseudo/fcp@0                 
pseudo-device: fcsm0                  
fcsm0 is /pseudo/fcsm@0               
pseudo-device: ipd0                   
ipd0 is /pseudo/ipd@0                 
pseudo-device: stmf0                  
stmf0 is /pseudo/stmf@0               
pseudo-device: fssnap0                
fssnap0 is /pseudo/fssnap@0           
pseudo-device: pm0                    
pm0 is /pseudo/pm@0                   
pseudo-device: lx_systrace0           
lx_systrace0 is /pseudo/lx_systrace@0 
pseudo-device: viona0                 
viona0 is /pseudo/viona@0             
sd3 at ahci0: target 3 lun 0          
sd3 is /pci@0,0/pci1458,b002@1f,2/cdrom@3,0
device pciclass,030000@0(display#0) keeps up device scsiclass,05@3,0(cdrom#3), but the former is not power managed

panic[cpu6]/thread=fffffe434cb47440:  
BAD TRAP: type=e (#pf Page fault) rp=fffffe005ef8dbe0 addr=0 occurred in module "genunix" due to a NULL pointer dereference

memcached:                            
#pf Page fault                        
Bad kernel fault at addr=0x0          
pid=12590, pc=0xfffffffffbe87049, sp=0xfffffe005ef8dcd0, eflags=0x10282
cr0: 80050033<pg,wp,ne,et,mp,pe>  cr4: 626f8<osxsav,pcide,vmxe,xmme,fxsr,pge,mce,pae,pse,de>
cr2: 0                                
cr3: b93026000                        
cr8: 0                                

        rdi:                0 rsi:         ffff64d3 rdx:                0
        rcx: ffffffffc01403c0  r8:                0  r9: fffffe434cb47440
        rax:                0 rbx: ffffffffc0145380 rbp: fffffe005ef8dce0
        r10:        3c391542a r11:                0 r12: fffffe43d5151f40
        r13: fffffe434cb47440 r14:                1 r15: fffffe43d5151dc0
        fsb:     7fffdec00700 gsb: fffffe43181e2000  ds:               4b
         es:               4b  fs:                0  gs:                0
        trp:                e err:                0 rip: fffffffffbe87049
         cs:               30 rfl:            10282 rsp: fffffe005ef8dcd0
         ss:               38

fffffe005ef8dae0 unix:die+c6 ()
fffffe005ef8dbd0 unix:trap+11fd ()
fffffe005ef8dbe0 unix:_cmntrap+e9 ()
fffffe005ef8dce0 genunix:ts2hrt+9 ()
fffffe005ef8dd80 lx_brand:futex_wait+193 ()
fffffe005ef8de70 lx_brand:lx_futex+31e ()
fffffe005ef8def0 lx_brand:lx_syscall_enter+1aa ()
fffffe005ef8df10 unix:brand_sys_syscall+1c6 ()

NOTICE: ahci0: ahci_tran_reset_dport port 0 reset port
NOTICE: ahci0: ahci_tran_reset_dport port 1 reset port

Glad to gather any additional information. I searched for known issues and didn't see anything that quite matches.

I don't think it should matter, but the image as reported by vmadm get is "image_uuid": "b3d02644-d6b2-11e5-bf13-8b034aec4749"

papertigers commented 4 years ago

@bahamat is this the same issue that you recently fixed before we look into this?

bahamat commented 4 years ago

Yep.

This was fixed as OS-8141 in joyent/illumos-joyent@71d3d1fb721d3893ca04b652d1b175ee2f54ed05.

The giveaway is the null pointer in ts2hrt().

ingenthr commented 4 years ago

Excellent, thanks!