TritonDataCenter / illumos-joyent

Community developed and maintained version of the OS/Net consolidation
http://www.illumos.org/projects/illumos-gate
266 stars 111 forks source link

assertion failed: epp->ep_state & UGEN_EP_STATE_INTR_IN_POLLING_ON #185

Open wiedi opened 5 years ago

wiedi commented 5 years ago

Had the following panic:

2019-01-14T19:09:19.831161+00:00 cube-sw unix: [ID 836849 kern.notice] #012#015panic[cpu1]/thread=fffffe0c189dd440: 
2019-01-14T19:09:19.831189+00:00 cube-sw genunix: [ID 403854 kern.notice] assertion failed: epp->ep_state & UGEN_EP_STATE_INTR_IN_POLLING_ON, file: ../../common/io/usb/usba/usba_ugen.c, line: 2816
2019-01-14T19:09:19.831201+00:00 cube-sw unix: [ID 100000 kern.notice] #012
2019-01-14T19:09:19.831214+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c758f0 genunix:process_type+16c210 ()
2019-01-14T19:09:19.831226+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75970 usba:ugen_epx_intr_IN_req+160 ()
2019-01-14T19:09:19.831238+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c759f0 usba:ugen_epx_req+446 ()
2019-01-14T19:09:19.831249+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75a70 usba:ugen_strategy+20b ()
2019-01-14T19:09:19.831293+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75b50 genunix:default_physio+33d ()
2019-01-14T19:09:19.831307+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75b90 genunix:physio+25 ()
2019-01-14T19:09:19.831319+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75bd0 usba:usb_ugen_read+68 ()
2019-01-14T19:09:19.831330+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75c20 ugen:ugen_skel_read+5e ()
2019-01-14T19:09:19.831343+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75c50 genunix:cdev_read+2d ()
2019-01-14T19:09:19.831354+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75cf0 specfs:spec_read+263 ()
2019-01-14T19:09:19.831366+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75d90 genunix:fop_read+f3 ()
2019-01-14T19:09:19.831389+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75ef0 genunix:read+2b8 ()
2019-01-14T19:09:19.831403+00:00 cube-sw genunix: [ID 655072 kern.notice] fffffe0010c75f00 unix:brand_sys_syscall+320 ()
2019-01-14T19:09:19.831415+00:00 cube-sw unix: [ID 100000 kern.notice] 
2019-01-14T19:09:19.831426+00:00 cube-sw ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 2 reset port
2019-01-14T19:09:19.831438+00:00 cube-sw ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 3 reset port
2019-01-14T19:09:19.831450+00:00 cube-sw ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 1 reset port
2019-01-14T19:09:19.831461+00:00 cube-sw ahci: [ID 405573 kern.info] NOTICE: ahci0: ahci_tran_reset_dport port 0 reset port
2019-01-14T19:09:19.831473+00:00 cube-sw genunix: [ID 365924 kern.warning] WARNING: exceeded kmem_dump space of 33792 bytes: kmem state in dump may be inconsistent

The system was running an joyent_20181227T154655Z AMD test image. Sadly the dump could not be created. I've since resized zones/dump so maybe I’ll catch it when it happens again.

wiedi commented 5 years ago

Just happend again and this time I got a dump and some more clues.

> ::status
debugging crash dump vmcore.1 (64-bit) from cube-sw
operating system: 5.11 joyent_20181227T154655Z (i86pc)
image uuid: (not set)
panic message: assertion failed: epp->ep_state & UGEN_EP_STATE_INTR_IN_POLLING_ON, file: ../../common/io/usb/usba/usba_ugen.c, line: 2816
dump content: kernel pages only
> $C
fffffe000f7588c0 vpanic()
fffffe000f7588f0 0xfffffffffbe157f8()
fffffe000f758970 ugen_epx_intr_IN_req+0x160(fffffe0c800d9000, fffffe0c800d9b70, fffffe0bc8d70dd8, fffffe000f7589ac)
fffffe000f7589f0 ugen_epx_req+0x446(fffffe0c800d9000, fffffe0bc8d70dd8)
fffffe000f758a70 ugen_strategy+0x20b(fffffe0bc8d70dd8)
fffffe000f758b50 default_physio+0x33d(fffffffff7aac710, 0, fa00000004, 40, fffffffff7aac930, fffffe000f758e50)
fffffe000f758b90 physio+0x25(fffffffff7aac710, 0, fa00000004, 40, fffffffff7aac930, fffffe000f758e50)
fffffe000f758bd0 usb_ugen_read+0x68(fffffe0bc21d1008, fa00000004, fffffe000f758e50, fffffe0c3b592118)
fffffe000f758c20 ugen_skel_read+0x5e(fa00000004, fffffe000f758e50, fffffe0c3b592118)
fffffe000f758c50 cdev_read+0x2d(fa00000004, fffffe000f758e50, fffffe0c3b592118)
fffffe000f758cf0 spec_read+0x263(fffffe12dc480400, fffffe000f758e50, 0, fffffe0c3b592118, 0)
fffffe000f758d90 fop_read+0xf3(fffffe12dc480400, fffffe000f758e50, 0, fffffe0c3b592118, 0)
fffffe000f758ef0 read+0x2b8(a, fffffc7fffdffbd0, 14)
fffffe000f758f00 sys_syscall+0x2a5()
> ::panicinfo
             cpu                1
          thread fffffe0bcfc31080
         message assertion failed: epp->ep_state & UGEN_EP_STATE_INTR_IN_POLLING_ON, file: ../../common/io/usb/usba/usba_ugen.c, line: 2816
             rdi fffffffffbf66cb8
             rsi fffffe000f758860
             rdx fffffffff7ab99c8
             rcx              b00
              r8                1
              r9                0
             rax fffffe000f758880
             rbx fffffe0c800d9b70
             rbp fffffe000f7588c0
             r10 303d746e756f635f
             r11 fffffe000f758581
             r12 fffffe0bc8d70dd8
             r13 fffffe0c800d9000
             r14 fffffe000f7589ac
             r15                0
          fsbase                0
          gsbase fffffe0bbacd7000
              ds               4b
              es               4b
              fs                0
              gs              1c3
          trapno                0
             err                0
             rip fffffffffb88b4b0
              cs               30
          rflags              286
             rsp fffffe000f758858
              ss               38
          gdt_hi                0
          gdt_lo         c00001ef
          idt_hi                0
          idt_lo         f0000fff
             ldt                0
            task               70
             cr0         80050033
             cr2           56b010
             cr3        1fa4ed000
             cr4              6f8
> ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc5b000  1f    5    0  59  yes    no t-1    fffffe0c7e7ff440 io_ops_
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 fffffe0010291c20 sched
           QUIESCED                59 fffffe0bd0e58040 io_ops_
             EXISTS                56 fffffe0c14020820 node
             ENABLE                 1 fffffe0c7c5fd0c0 munin-update
                                    1 fffffe0c7c2560c0 vmadm

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 fffffffffbc60900  1b    3    0  54   no    no t-0    fffffe0bcfc31080 apcupsd
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 fffffe000fc51c20 sched
             EXISTS                 1 fffffe0c137804a0 vmadm
             ENABLE                 1 fffffe0bb914c060 uptime

That the panic is happening inside usba and that apcupsd was running suggests that this is due to one of my own experiments.

The system has an APC UPS connected via USB and delegated to a zone like this:

zonecfg -z 08f8b95c-417e-4910-f137-f25760c16b14
    add device
    set match=/dev/usb/51d.2/0/*
    end
    commit

The apcupsd running inside that zone was installed via pkgsrc and using the default config only modified with UPSTYPE usb.

If there is interest I can share the dump or collect more information, otherwise I'll just switch back to a non-debug kernel for now.

rmustacc commented 5 years ago

On 1/18/19 8:36 , Sebastian Wiedenroth wrote:

Just happend again and this time I got a dump and some more clues.

> ::status
debugging crash dump vmcore.1 (64-bit) from cube-sw
operating system: 5.11 joyent_20181227T154655Z (i86pc)
image uuid: (not set)
panic message: assertion failed: epp->ep_state & UGEN_EP_STATE_INTR_IN_POLLING_ON, file: ../../common/io/usb/usba/usba_ugen.c, line: 2816
dump content: kernel pages only
> $C
fffffe000f7588c0 vpanic()
fffffe000f7588f0 0xfffffffffbe157f8()
fffffe000f758970 ugen_epx_intr_IN_req+0x160(fffffe0c800d9000, fffffe0c800d9b70, fffffe0bc8d70dd8, fffffe000f7589ac)
fffffe000f7589f0 ugen_epx_req+0x446(fffffe0c800d9000, fffffe0bc8d70dd8)
fffffe000f758a70 ugen_strategy+0x20b(fffffe0bc8d70dd8)
fffffe000f758b50 default_physio+0x33d(fffffffff7aac710, 0, fa00000004, 40, fffffffff7aac930, fffffe000f758e50)
fffffe000f758b90 physio+0x25(fffffffff7aac710, 0, fa00000004, 40, fffffffff7aac930, fffffe000f758e50)
fffffe000f758bd0 usb_ugen_read+0x68(fffffe0bc21d1008, fa00000004, fffffe000f758e50, fffffe0c3b592118)
fffffe000f758c20 ugen_skel_read+0x5e(fa00000004, fffffe000f758e50, fffffe0c3b592118)
fffffe000f758c50 cdev_read+0x2d(fa00000004, fffffe000f758e50, fffffe0c3b592118)
fffffe000f758cf0 spec_read+0x263(fffffe12dc480400, fffffe000f758e50, 0, fffffe0c3b592118, 0)
fffffe000f758d90 fop_read+0xf3(fffffe12dc480400, fffffe000f758e50, 0, fffffe0c3b592118, 0)
fffffe000f758ef0 read+0x2b8(a, fffffc7fffdffbd0, 14)
fffffe000f758f00 sys_syscall+0x2a5()
> ::panicinfo
             cpu                1
          thread fffffe0bcfc31080
         message assertion failed: epp->ep_state & UGEN_EP_STATE_INTR_IN_POLLING_ON, file: ../../common/io/usb/usba/usba_ugen.c, line: 2816
             rdi fffffffffbf66cb8
             rsi fffffe000f758860
             rdx fffffffff7ab99c8
             rcx              b00
              r8                1
              r9                0
             rax fffffe000f758880
             rbx fffffe0c800d9b70
             rbp fffffe000f7588c0
             r10 303d746e756f635f
             r11 fffffe000f758581
             r12 fffffe0bc8d70dd8
             r13 fffffe0c800d9000
             r14 fffffe000f7589ac
             r15                0
          fsbase                0
          gsbase fffffe0bbacd7000
              ds               4b
              es               4b
              fs                0
              gs              1c3
          trapno                0
             err                0
             rip fffffffffb88b4b0
              cs               30
          rflags              286
             rsp fffffe000f758858
              ss               38
          gdt_hi                0
          gdt_lo         c00001ef
          idt_hi                0
          idt_lo         f0000fff
             ldt                0
            task               70
             cr0         80050033
             cr2           56b010
             cr3        1fa4ed000
             cr4              6f8
> ::cpuinfo -v
 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  0 fffffffffbc5b000  1f    5    0  59  yes    no t-1    fffffe0c7e7ff440 io_ops_
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 fffffe0010291c20 sched
           QUIESCED                59 fffffe0bd0e58040 io_ops_
             EXISTS                56 fffffe0c14020820 node
             ENABLE                 1 fffffe0c7c5fd0c0 munin-update
                                    1 fffffe0c7c2560c0 vmadm

 ID ADDR             FLG NRUN BSPL PRI RNRN KRNRN SWITCH THREAD           PROC
  1 fffffffffbc60900  1b    3    0  54   no    no t-0    fffffe0bcfc31080 apcupsd
                       |    |
            RUNNING <--+    +-->  PRI THREAD           PROC
              READY                60 fffffe000fc51c20 sched
             EXISTS                 1 fffffe0c137804a0 vmadm
             ENABLE                 1 fffffe0bb914c060 uptime

That the panic is happening inside usba and that apcupsd was running suggests that this is due to one of my own experiments.

The system has an APC UPS connected via USB and delegated to a zone like this:

zonecfg -z 08f8b95c-417e-4910-f137-f25760c16b14
  add device
  set match=/dev/usb/51d.2/0/*
  end
  commit

The apcupsd running inside that zone was installed via pkgsrc and using the default config only modified with UPSTYPE usb.

If there is interest I can share the dump or collect more information, otherwise I'll just switch back to a non-debug kernel for now.

While I haven't had a chance to look at the surrounding code, I think we need to figure out whether that ASSERT is valid or not and fix accordingly.