dglo / dor-driver

Test upload of Mantis issues
0 stars 0 forks source link

[jkelley on 2017-03-03 20:41:12] : DOMCal run triggered dead DOM drop #117

Open dglo opened 3 years ago

dglo commented 3 years ago

28 DOMs dropped during the full-detector DOMCal (3/2017). The DOR driver reported dead DOM errors and returned an -EIO to the DOMCal surface client, causing the client to hang.

Could the one DOM have timed out when the partner DOM was transmitting its results?

Additional Information: From /var/log/messages:

Mar 3 09:20:56 ichub45 kernel: Holy dead DOM, Batman! Got 'dead DOM' state for card=0 ch=7 minor=7 Mar 3 09:20:56 ichub45 kernel: DUMPING RX FIFO for card 0 ch 7 minor 7 Mar 3 09:20:56 ichub45 kernel: jiffies=11840463674 nonemptyRXjif=11840450324 dt=13350 Mar 3 09:20:56 ichub45 kernel: Displaying register dump prior to draining FIFO... Mar 3 09:20:56 ichub45 kernel: FPGA registers: Mar 3 09:20:56 ichub45 kernel: CTRL 0x6800100f Mar 3 09:20:56 ichub45 kernel: GSTAT 0x00007bc2 Mar 3 09:20:56 ichub45 kernel: DSTAT 0x000000ff Mar 3 09:20:56 ichub45 kernel: TTSIC 0x0000ffff Mar 3 09:20:56 ichub45 kernel: RTSIC 0xad52122d Mar 3 09:20:56 ichub45 kernel: INTEN 0x00010200 Mar 3 09:20:56 ichub45 kernel: DOMS 0xff0000ff Mar 3 09:20:56 ichub45 kernel: MRAR 0x00022c00 Mar 3 09:20:56 ichub45 kernel: MRTC 0x03000000 Mar 3 09:20:56 ichub45 kernel: MWAR 0x00021004 Mar 3 09:20:56 ichub45 kernel: MWTC 0x03000000 Mar 3 09:20:56 ichub45 kernel: CURL 0xe19e8c14 Mar 3 09:20:56 ichub45 kernel: DCUR 0x3c6f3095 Mar 3 09:20:56 ichub45 kernel: FLASH 0xff2e031d Mar 3 09:20:56 ichub45 kernel: DOMC 0x00000000 Mar 3 09:20:56 ichub45 kernel: CERR 0x700b002d Mar 3 09:20:56 ichub45 kernel: DCREV 0x00000024 Mar 3 09:20:56 ichub45 kernel: FREV 0x09010471 ... Mar 3 09:21:38 ichub45 kernel: dh: Unrecoverable error, dropping connection on minor 7

From domcal.err:

/dev/dhc0w3dB: read gave result -1 and error 5

From domcal.out of dropped DOM

Noise Rate: 898.900000. Obtaining charge histogram

dglo commented 3 years ago

[jkelley on 2018-01-23 20:34:51] Can I get around this by retrying in the surface client?

dglo commented 3 years ago

[jkelley on 2018-02-12 15:36:23] Cannot replicate on scube even after pushing up large amounts of data in XML transfer.

dglo commented 3 years ago

[jkelley on 2018-02-14 15:59:15] Since this is not reproducible, issues 8357 and 8358 try to work around the problem by recovering from a dropped DOM in the DOMCal surface client.

dglo commented 3 years ago

[jkelley on 2018-02-16 17:18:22] This has happened at least once during MOAT on pcts-hub, but not always.

dglo commented 3 years ago

[jkelley on 2018-03-08 16:10:03] This happened again in 2018, on 15 DOMs. Retry (device close/open) did not work, failed with resource unavailable:

From domcal.err:

/dev/dhc3w2dA: read gave result -1 and error 5 /dev/dhc3w2dA: retrying... /dev/dhc3w2dA: Resource temporarily unavailable Caught runtime error while calibrating DOM 1c55ee2b1695: Failure on receive stream from DOM /dev/dhc0w1dB: read gave result -1 and error 5 /dev/dhc0w1dB: retrying... /dev/dhc0w1dB: Resource temporarily unavailable Caught runtime error while calibrating DOM 60fbc4b4e69e: Failure on receive stream from DOM

From /var/log/messages (verbose = 1)

... Mar 7 22:37:50 ichub01 kernel: dh: closed 52B (minor 45) Mar 7 22:39:58 ichub01 kernel: dh: closed 61A (minor 50) Mar 7 22:40:04 ichub01 kernel: dh: closed 42B (minor 37) Mar 7 22:40:20 ichub01 kernel: dh: closed 61B (minor 51) Mar 7 22:40:27 ichub01 kernel: dh: closed 60A (minor 48) Mar 7 22:41:07 ichub01 kernel: dh: closed 31B (minor 27) Mar 7 22:41:37 ichub01 kernel: dh: closed 11B (minor 11) Mar 7 22:41:38 ichub01 kernel: dh: closed 62B (minor 53) Mar 7 22:41:52 ichub01 kernel: dh: closed 10A (minor 8) Mar 7 22:42:06 ichub01 kernel: dh: closed 33B (minor 31) Mar 7 22:42:19 ichub01 kernel: dh: closed 02B (minor 5) Mar 7 22:42:24 ichub01 kernel: dh: closed 51B (minor 43) Mar 7 22:42:25 ichub01 kernel: dh: closed 30B (minor 25) Mar 7 22:42:30 ichub01 kernel: Holy dead DOM, Batman! Got 'dead DOM' state for card=3 ch=4 min or=28 Mar 7 22:42:30 ichub01 kernel: DUMPING RX FIFO for card 3 ch 4 minor 28 Mar 7 22:42:30 ichub01 kernel: jiffies=12848913334 nonemptyRXjif=12848899984 dt=13350 Mar 7 22:42:30 ichub01 kernel: Displaying register dump prior to draining FIFO... Mar 7 22:42:30 ichub01 kernel: FPGA registers: Mar 7 22:42:30 ichub01 kernel: CTRL 0x6800700f Mar 7 22:42:30 ichub01 kernel: GSTAT 0x00007bc2 Mar 7 22:42:30 ichub01 kernel: DSTAT 0x000030ff Mar 7 22:42:30 ichub01 kernel: TTSIC 0x0000ffff Mar 7 22:42:30 ichub01 kernel: RTSIC 0x758a8865 Mar 7 22:42:30 ichub01 kernel: INTEN 0x00010200 Mar 7 22:42:30 ichub01 kernel: DOMS 0xff0000ff Mar 7 22:42:30 ichub01 kernel: MRAR 0x00095c00 Mar 7 22:42:30 ichub01 kernel: MRTC 0x06000000 Mar 7 22:42:30 ichub01 kernel: MWAR 0x00094004 Mar 7 22:42:30 ichub01 kernel: MWTC 0x06000000 Mar 7 22:42:30 ichub01 kernel: CURL 0xe19e8c14 Mar 7 22:42:30 ichub01 kernel: DCUR 0x3c7030a3 Mar 7 22:42:30 ichub01 kernel: FLASH 0xff2e031c Mar 7 22:42:30 ichub01 kernel: DOMC 0x00000000 Mar 7 22:42:30 ichub01 kernel: CERR 0x7008005d Mar 7 22:42:30 ichub01 kernel: DCREV 0x00000024 Mar 7 22:42:30 ichub01 kernel: FREV 0x09010471 Mar 7 22:42:30 ichub01 kernel: idump=0000 rtsic=0x758a8865 fifo=0xd92dd7c1 Mar 7 22:42:30 ichub01 kernel: idump=0001 rtsic=0x758a8875 fifo=0x36343833 ... Mar 7 22:42:30 ichub01 kernel: dh_poll: minor 28 is in error state. Mar 7 22:42:30 ichub01 kernel: dh: Unrecoverable error, dropping connection on minor 28 Mar 7 22:42:30 ichub01 kernel: dh: closed 32A (minor 28) Mar 7 22:42:30 ichub01 kernel: dh: opening 32A (minor 28) Mar 7 22:42:35 ichub01 kernel: dh: HARDWARE TIMEOUT on minor 49, did comms. reset Mar 7 22:42:35 ichub01 kernel: dh: minor 49, DOMS ok, comms reset succeeded. Mar 7 22:42:37 ichub01 kernel: dh: closed 33A (minor 30) Mar 7 22:42:38 ichub01 kernel: dh: closed 03B (minor 7) Mar 7 22:42:44 ichub01 kernel: dh: closed 20B (minor 17) Mar 7 22:42:47 ichub01 kernel: dh: closed 53A (minor 46) Mar 7 22:43:00 ichub01 kernel: dh_open: Timeout (synchronize failed) on minor 28. Mar 7 22:43:00 ichub01 kernel: FPGA registers: Mar 7 22:43:00 ichub01 kernel: CTRL 0x6800700f Mar 7 22:43:00 ichub01 kernel: GSTAT 0x00007bc2 Mar 7 22:43:00 ichub01 kernel: DSTAT 0x000030ff Mar 7 22:43:00 ichub01 kernel: TTSIC 0x0000ffff Mar 7 22:43:00 ichub01 kernel: RTSIC 0x25ca8a35 Mar 7 22:43:00 ichub01 kernel: INTEN 0x00010200 Mar 7 22:43:00 ichub01 kernel: DOMS 0xff0000ff Mar 7 22:43:00 ichub01 kernel: MRAR 0x00095400 Mar 7 22:43:00 ichub01 kernel: MRTC 0x05000000 Mar 7 22:43:00 ichub01 kernel: MWAR 0x00095804 Mar 7 22:43:00 ichub01 kernel: MWTC 0x05000000 Mar 7 22:43:00 ichub01 kernel: CURL 0xe19e8c14 Mar 7 22:43:00 ichub01 kernel: DCUR 0x3c70309c Mar 7 22:43:00 ichub01 kernel: FLASH 0xff2e031c Mar 7 22:43:00 ichub01 kernel: DOMC 0x00000000 Mar 7 22:43:00 ichub01 kernel: CERR 0x7008006f Mar 7 22:43:00 ichub01 kernel: DCREV 0x00000024 Mar 7 22:43:00 ichub01 kernel: FREV 0x09010471 ar 7 22:43:02 ichub01 kernel: dh: closed 62A (minor 52) Mar 7 22:43:11 ichub01 kernel: dh: closed 32B (minor 29) Mar 7 22:43:27 ichub01 kernel: dh: closed 60B (minor 49) Mar 7 22:43:32 ichub01 kernel: dh: closed 22B (minor 21) Mar 7 22:43:37 ichub01 kernel: dh: closed 13A (minor 14) Mar 7 22:43:44 ichub01 kernel: dh: closed 43B (minor 39) Mar 7 22:43:45 ichub01 kernel: dh: closed 00A (minor 0) Mar 7 22:43:45 ichub01 kernel: dh: closed 01A (minor 2) Mar 7 22:43:47 ichub01 kernel: dh: closed 63A (minor 54) Mar 7 22:44:00 ichub01 kernel: Holy dead DOM, Batman! Got 'dead DOM' state for card=0 ch=3 minor=3 Mar 7 22:44:00 ichub01 kernel: DUMPING RX FIFO for card 0 ch 3 minor 3 Mar 7 22:44:00 ichub01 kernel: jiffies=12849003265 nonemptyRXjif=12848989915 dt=13350 Mar 7 22:44:00 ichub01 kernel: Displaying register dump prior to draining FIFO... Mar 7 22:44:00 ichub01 kernel: FPGA registers: Mar 7 22:44:00 ichub01 kernel: CTRL 0x6800300f Mar 7 22:44:00 ichub01 kernel: GSTAT 0x00007bc2 Mar 7 22:44:00 ichub01 kernel: DSTAT 0x000000ff Mar 7 22:44:00 ichub01 kernel: TTSIC 0x0000ffff Mar 7 22:44:00 ichub01 kernel: RTSIC 0x5aa5a852 Mar 7 22:44:00 ichub01 kernel: INTEN 0x00010200 Mar 7 22:44:00 ichub01 kernel: DOMS 0xff0000ff Mar 7 22:44:00 ichub01 kernel: MRAR 0x00089c00 Mar 7 22:44:00 ichub01 kernel: MRTC 0x06000000 Mar 7 22:44:00 ichub01 kernel: MWAR 0x00088004 Mar 7 22:44:00 ichub01 kernel: MWTC 0x06000000 Mar 7 22:44:00 ichub01 kernel: CURL 0xe19e8c14 Mar 7 22:44:00 ichub01 kernel: DCUR 0x3c62309d Mar 7 22:44:00 ichub01 kernel: FLASH 0xff2e031e Mar 7 22:44:00 ichub01 kernel: DOMC 0x00000000 Mar 7 22:44:00 ichub01 kernel: CERR 0x70080030 Mar 7 22:44:00 ichub01 kernel: DCREV 0x00000024 Mar 7 22:44:00 ichub01 kernel: FREV 0x09010471 Mar 7 22:44:00 ichub01 kernel: idump=0000 rtsic=0x5aa5a852 fifo=0xeddf5bdf Mar 7 22:44:00 ichub01 kernel: idump=0001 rtsic=0x5aa5a852 fifo=0x7fbfff76 ... Mar 7 22:44:00 ichub01 kernel: idump=0146 rtsic=0x5aa5a05a fifo=0xbfff5ebf Mar 7 22:44:00 ichub01 kernel: dh_poll: minor 3 is in error state. Mar 7 22:44:00 ichub01 kernel: dh: Unrecoverable error, dropping connection on minor 3 Mar 7 22:44:00 ichub01 kernel: dh: closed 01B (minor 3) Mar 7 22:44:00 ichub01 kernel: dh: opening 01B (minor 3) Mar 7 22:44:15 ichub01 kernel: dh: closed 50A (minor 40) Mar 7 22:44:19 ichub01 kernel: dh: closed 41A (minor 34) Mar 7 22:44:30 ichub01 kernel: dh_open: Timeout (synchronize failed) on minor 3. Mar 7 22:44:30 ichub01 kernel: FPGA registers: Mar 7 22:44:30 ichub01 kernel: CTRL 0x6800300f Mar 7 22:44:30 ichub01 kernel: GSTAT 0x00007bc2 Mar 7 22:44:30 ichub01 kernel: DSTAT 0x000000ff Mar 7 22:44:30 ichub01 kernel: TTSIC 0x0000ffff Mar 7 22:44:30 ichub01 kernel: RTSIC 0x52a5a55a Mar 7 22:44:30 ichub01 kernel: INTEN 0x00010200 Mar 7 22:44:30 ichub01 kernel: DOMS 0xff0000ff Mar 7 22:44:30 ichub01 kernel: MRAR 0x0008a400 Mar 7 22:44:30 ichub01 kernel: MRTC 0x03000000 Mar 7 22:44:30 ichub01 kernel: MWAR 0x0008a804 Mar 7 22:44:30 ichub01 kernel: MWTC 0x03000000 Mar 7 22:44:30 ichub01 kernel: CURL 0xe19e8c14 Mar 7 22:44:30 ichub01 kernel: DCUR 0x3c63309e Mar 7 22:44:30 ichub01 kernel: FLASH 0xff2e031e Mar 7 22:44:30 ichub01 kernel: DOMC 0x00000000 Mar 7 22:44:30 ichub01 kernel: CERR 0x70080030 Mar 7 22:44:30 ichub01 kernel: DCREV 0x00000024 Mar 7 22:44:30 ichub01 kernel: FREV 0x09010471 Mar 7 22:44:31 ichub01 kernel: dh: closed 12A (minor 12) Mar 7 22:44:42 ichub01 kernel: dh: closed 51A (minor 42) Mar 7 22:44:48 ichub01 kernel: dh: closed 13B (minor 15) Mar 7 22:44:51 ichub01 kernel: dh: closed 20A (minor 16) Mar 7 22:45:14 ichub01 kernel: dh: closed 22A (minor 20) Mar 7 22:45:30 ichub01 kernel: dh: closed 42A (minor 36) Mar 7 22:46:02 ichub01 kernel: dh: closed 02A (minor 4) Mar 7 22:46:11 ichub01 kernel: dh: closed 31A (minor 26) Mar 7 22:46:11 ichub01 kernel: dh: closed 30A (minor 24) Mar 7 22:46:14 ichub01 kernel: dh: closed 21A (minor 18) Mar 7 22:46:18 ichub01 kernel: dh: closed 41B (minor 35) Mar 7 22:46:20 ichub01 kernel: dh: closed 23A (minor 22) Mar 7 22:46:35 ichub01 kernel: dh: closed 50B (minor 41) Mar 7 22:46:39 ichub01 kernel: dh: HARDWARE TIMEOUT on minor 44, did comms. reset Mar 7 22:46:39 ichub01 kernel: dh: minor 44, DOMS ok, comms reset succeeded. Mar 7 22:46:45 ichub01 kernel: dh: closed 10B (minor 9) Mar 7 22:46:57 ichub01 kernel: dh: closed 00B (minor 1) Mar 7 22:46:58 ichub01 kernel: dh: closed 21B (minor 19) Mar 7 22:47:31 ichub01 kernel: dh: closed 52A (minor 44) Mar 7 22:47:57 ichub01 kernel: dh: closed 11A (minor 10) Mar 7 22:47:59 ichub01 kernel: dh: closed 40B (minor 33) Mar 7 22:48:13 ichub01 kernel: dh: closed 43A (minor 38) Mar 7 22:48:14 ichub01 kernel: dh: closed 53B (minor 47) Mar 7 22:49:52 ichub01 kernel: dh: closed 03A (minor 6) Mar 7 22:50:16 ichub01 kernel: dh: closed 23B (minor 23) Mar 7 22:51:13 ichub01 kernel: dh: closed 12B (minor 13) Mar 7 22:51:22 ichub01 kernel: dh: closed 40A (minor 32) Mar 7 23:34:39 ichub01 kernel: dh: Powering off all DOMs.