dglo / dor-driver

Test upload of Mantis issues
0 stars 0 forks source link

[kaeld on 2007-05-10 17:19:21] : DOM state change failures softboot->domapp [was: StringHub - DOMs may not be softbooting] #31

Open dglo opened 3 years ago

dglo commented 3 years ago

Some DOMs are dropping out of the run (timeouts in the 'transitionToDomApp' call). This appears to be a softboot issue.

dglo commented 3 years ago

[kaeld on 2007-05-17 14:12:20] Thorsten S. looked into this problem at SPS on 5/16 (see http://docushare.icecube.wisc.edu/docushare/dsweb/View/Bulletin-833?init=true). It's really a hardware problem. His study concludes that one can COMM RESET + SOFTBOOT + COMM RESET to ALWAYS get a successful softboot. This needs to be implemented in the StringHub.

dglo commented 3 years ago

[kaeld on 2007-05-17 14:24:32] commit #1398 (whoa, creepy!) fixes this issue.

dglo commented 3 years ago

[kaeld on 2007-06-04 23:58:56] We still see several DOMs that drop out in the monitoring pages. This appears to be having a macroscopic effect on the IceCube array event trigger rate! Looking at 107980 the event rate is 535 Hz instead of the usual 539 Hz. The following DOMs have been dropped ... 39-04 49-04 56-06 57-01 59-05 The datacollector reports stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007]java.nio.channels.ClosedByInterruptException stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007] at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007] at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:150) stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007] at icecube.daq.domapp.DOMIO.recv(DOMIO.java:70) stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007] at icecube.daq.domapp.DOMApp.transitionToDOMApp(DOMApp.java:578) stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007] at icecube.daq.domapp.DataCollector.runcore(DataCollector.java:752) stringHub STDERR-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007] at icecube.daq.domapp.DataCollector.run(DataCollector.java:695) stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Sat Jun 02 01:30:41 UTC 2007]Intercepted error in DataCollector runcore: java.nio.channels.ClosedByInterruptException which means that the comms were lost at some point - this happens 10 s after the startup so the comm loss is at the beginning of the run. what is really strange is that i see stringHub icecube.daq.domapp.DOMApp-DataCollector-02A ERROR [Sat Jun 02 01:30:31 UTC 2007]Return message type/subtype does not match outgoing message (32, 73). just above - this i think means that the datacollector was communicating with the dom but then for some reason the transition into domapp was not successful\n\n

dglo commented 3 years ago

[kaeld on 2007-06-04 17:25:34] John J suggests to dump the FPGA procfile (card-wise) and the COMSTAT procfile.

dglo commented 3 years ago

[jacobsen on 2007-06-05 16:23:17] There should, perhaps, be a domapp python test to try to see this behavior on PCTS and scube (and SPS) Is it clear that the softboot is succeeding (DOM makes it to iceboot), but the transition to domapp is failing?

dglo commented 3 years ago

[kaeld on 2007-06-11 14:27:14] checkin of 1590 adds some timed sleeps to the COMM RESET / SOFTBOOT / COMM REST segment of the DataCollector code. this may help / alleviate ... when debugging at pole on friday i found that the behavior is a. i do the reset / boot / reset b. about 30 s later the file cannot be opened (resources unavailable error from driver) c. the /var/log/messages from this time include messages like Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (671) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (4041) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (1919) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (2216) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (3561) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (3314) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (3806) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (1853) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (3327) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:39 sps-ichub39 kernel: dh: softboot 01B succeeded Jun 8 00:57:39 sps-ichub39 kernel: dh_dom_isready: DOM 3 is ready in DOMS. Jun 8 00:57:39 sps-ichub39 kernel: dh_iscomm_write_proc: Reset succeeded (card 0, pair 1, dom 1). Jun 8 00:57:39 sps-ichub39 kernel: dh: opening 01B (minor 3) Jun 8 00:57:39 sps-ichub39 kernel: dh: softboot 11B succeeded Jun 8 00:57:39 sps-ichub39 kernel: dh_dom_isready: DOM 3 is ready in DOMS. Jun 8 00:57:39 sps-ichub39 kernel: dh_iscomm_write_proc: Reset succeeded (card 1, pair 1, dom 1). Jun 8 00:57:39 sps-ichub39 kernel: dh: opening 11B (minor 11) Jun 8 00:57:39 sps-ichub39 kernel: dh: softboot 03B succeeded Jun 8 00:57:39 sps-ichub39 kernel: dh_dom_isready: DOM 7 is ready in DOMS. Jun 8 00:57:39 sps-ichub39 kernel: dh_iscomm_write_proc: Reset succeeded (card 0, pair 3, dom 1). Jun 8 00:57:39 sps-ichub39 kernel: dh: opening 03B (minor 7) Jun 8 00:57:39 sps-ichub39 kernel: dh: softboot 10B succeeded Jun 8 00:57:39 sps-ichub39 kernel: dh_dom_isready: DOM 1 is ready in DOMS. Jun 8 00:57:39 sps-ichub39 kernel: dh_iscomm_write_proc: Reset succeeded (card 1, pair 0, dom 1). Jun 8 00:57:39 sps-ichub39 kernel: dh: opening 10B (minor 9) Jun 8 00:57:39 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (1789) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:40 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (4064) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:40 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (2130) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:40 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (2795) on card 0, channel 2. Returning 0-length (bogus) packet. Jun 8 00:57:40 sps-ichub39 kernel: dh_enqueue_message_from_rx_fifo: Corrupt bytecount (1718) on card 0, channel 2. Returning 0-length (bogus) packet.

dglo commented 3 years ago

[jacobsen on 2007-06-11 19:44:27] I stole this issue since it seems to be below the level of stringhub.

dglo commented 3 years ago

[jacobsen on 2007-06-11 19:44:50] change to "new" because I like everything to be the same color

dglo commented 3 years ago

[jacobsen on 2007-06-11 19:45:17] For DOM testing (from Mark):

i look at the monitoring pages to see which ones are commonly missing.

i remember these DOMs being the big offenders:

57-01 Getkobben 59-01 Chorophobia 39-04 Trichopathophobia

mark

dglo commented 3 years ago

[jacobsen on 2007-06-12 00:53:06] Hi Kael,

I had a chance to look at these messages - it means that after softboot, data is corrupt or lost inside the DOR firmware FIFOs. Suspect DOR firmware problem. Will code up a test and pursue tests on SPS as discussed today.

John

dglo commented 3 years ago

[jacobsen on 2007-06-12 14:32:38] Test comment for Thorsten

dglo commented 3 years ago

[jacobsen on 2007-06-14 02:59:25] After a second night of tests at SPS, I have learned a few more things about the problem of dropped DOMs (this will primarily be of interest to Kael, Kalle, Thorsten and Azriel).

For reference this is Mantis issue 1398.

Lessons learned last night and tonight:

(1) my low-level python-based domapp tests were unable to clearly reproduce the problem in the way I expected to see it, so the problem may be very timing-dependent.

(2) running pDAQ and looking at the driver messages, there seem to be two types of failures after the softboot operation, which have distinct signatures.

Type 1: Corrupt data appears in the FIFO. If this is quickly and forceably drained by the driver, pDAQ/StringHub seem to keep the DOM in the run with no exceptions thrown in the logs; if the FIFO is not drained (the normal dor-driver behavior), StringHub sometimes throws a ClosedByInterruptException and closes the /dev file for that DOM (thereby dropping it from the run).

Type 2: No corrupt data is seen in the FIFO, but the DOM /dev file cannot be opened by StringHub. There does not seem to be a hardware timeout.

I believe that the following two workarounds will address both type 1 and type 2 failures. However they are workarounds and do not address the underlying low-level behavior, which is still pathological, in my opinion

Workaround (a) for Type 1 - dor-driver completely and thoroughly drains the RX FIFO whenever any corrupt data is found, rather than in the somewhat more "relaxed" fashion which it currently uses (toss away 4 bytes at a time as needed until an intact packet is found).

Workaround (b) for Type 2 - StringHub adopts the following hideously ugly sequence for opening dev files after softboot:

Extensive debugging notes with register dumps, etc., attached.

In summary, we know a little more about the problem and we have a workaround, but there is low-level behavior that is not understood and not, unfortunately, particularly easy to duplicate outside of running pDAQ.

Suggested action items: (1) Kael and JJ implement and test the workarounds (2) Thorsten and Kalle continue to request debugging information from JJ so that they/we all can study the problem(s) (based on register dumps, etc.) when it occurs

Not sure how else to proceed on this particularly thorny issue; a decision should get made as to whether to pursue workarounds or, instead, a complete low-level understanding (which could take a lot of time and effort over the satellite).

dglo commented 3 years ago

[jacobsen on 2007-06-14 18:51:25] Workaround implemented on driver side for Type 1 failures per Azriel:

cvs commit: Examining . cvs commit: Examining domhub-testing cvs commit: Examining domhub-tools cvs commit: Examining testing Checking in RELEASE_NOTES; /home/icecube/cvsroot/dor-driver/driver/RELEASE_NOTES,v <-- RELEASE_NOTES new revision: 1.118; previous revision: 1.117 done Checking in dh.c; /home/icecube/cvsroot/dor-driver/driver/dh.c,v <-- dh.c new revision: 1.376; previous revision: 1.375 done

dglo commented 3 years ago

[jacobsen on 2007-06-14 18:54:06] /home/icecube/cvsroot/dor-driver/driver/dh.c,v <-- dh.c new revision: 1.377; previous revision: 1.376

dglo commented 3 years ago

[dglo on 2007-06-14 19:34:28] -- Retry DOMApp open if first attempt fails Sending src/main/java/icecube/daq/domapp/DataCollector.java Committed revision 1614.

dglo commented 3 years ago

[jacobsen on 2007-06-14 19:36:54] There is still a low level issue here for which I want to keep the issue open...

dglo commented 3 years ago

[dglo on 2007-06-14 20:19:26] There's a fix in play

dglo commented 3 years ago

[jacobsen on 2007-06-14 20:34:30] Moved project to dor-driver since it's below stringhub (probably)

dglo commented 3 years ago

[jacobsen on 2007-06-15 22:27:56] The dropped DOM situation is much improved, with the rate of dropped DOMs down to 1/run (from nearly 4/run).

Type 1 and type 2 run failures seem to have gone away with the V02-11-03 driver and Highland-07 StringHub workarounds, as far as I could tell. The monitoring pages show 9 dropped DOMs in 9 runs, all of which are accounted for by two new failure modes.

In "type 3" failures, the DOM /dev file does open successfully; however, about 14 seconds later the /dev file is closed by StringHub, which outputs a "ClosedByInterruptException." Dave and I examined the code, and StringHub is trying to read data from domapp, which does not arrive in the expected way; a watchdog timer tells the thread to terminate. This suggests an expansion to the existing workaround, which would retry the entire sequence of "comms reset, softboot, comms reset, open, transition to domapp, get mainboard ID" in a retry loop - i.e. if either open fails, or get MBID fails, do the whole thing over again. Or the code which manages the transition to domapp could be made more robust (IMO).

"Type 4" failures are limited to 66-45 "Alpaca," which goes uncommunicative at some variable time into the run after successful configuration. Azriel has asked us to remove Alpaca from our configurations.

The "Type 3" workaround in StringHub may be a bit trickier than yesterday's change, so Dave and I will discuss the best way to proceed on Monday.

I'll finish with a summary of dropped DOM failure modes to date:

Type 1:

Type 2:

Type 3:

Type 4:

dglo commented 3 years ago

[jacobsen on 2007-06-26 21:06:56] JJ and Dave put in a fix to retry the entire softboot->iceboot->domapp sequence if the DOM doesn't reply to a "get DOMMB ID" request the first time (issues 1578, 1561).

What remains is to figure out what's actually going on when domapp is uncommunicative.

dglo commented 3 years ago

[jacobsen on 2007-07-03 02:31:28] Kalle's note from 6/21/07 indicates there is something not 100% clean in 104q DOR firmware - suggests trying a new dor-driver at Pole at least to see if this fixes the root problem:

hi guys,

i changed 104q a little bit:

  1. the rx dual port ram (fifo) is designed in a cleaner way although i didn't find a real bug in the older version. 2. 104q had no timing requirements for critical local bus signals. this is fixed now and might be more important.

if you like, give it a try and let me know. i believe it's not a big risk.

thorsten i compiled with 7.1. but 7.0 should be good enough.

thanks, kalle

ps: com_105q.zip is on glacier

dglo commented 3 years ago

[jacobsen on 2007-07-11 19:54:24] Example debugging:

[jacobsen@pub runs]$ find 108888 -name 'stringHub.log' -exec grep -H -v "Return message" {} \; | grep -v "Start of log" 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Tue Jul 10 22:44:09 UTC 2007]Timeout on trial 0 getting DOM ID 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:java.nio.channels.ClosedByInterruptException 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:184) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at sun.nio.ch.FileChannelImpl.read(FileChannelImpl.java:150) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at icecube.daq.domapp.DOMIO.recv(DOMIO.java:71) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at icecube.daq.domapp.DOMApp.sendMessage(DOMApp.java:322) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at icecube.daq.domapp.DOMApp.sendMessage(DOMApp.java:289) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at icecube.daq.domapp.DOMApp.getMainboardID(DOMApp.java:183) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at icecube.daq.domapp.DataCollector.runcore(DataCollector.java:815) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: at icecube.daq.domapp.DataCollector.run(DataCollector.java:711) 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Tue Jul 10 22:44:09 UTC 2007]Driver comstat for 02A: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:/dev/dhc0w2dA 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:RX: 81B, MSGS=7 NINQ=0 PKTS=22 ACKS=3 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: BADPKT=12 BADHDR=11 BADSEQ=0 NCTRL=0 NCI=2 NIC=59 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=32 PKTS=43 ACKS=7 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: NACKQ=0 NRETXB=0 RETXB_BYTES=0 NRETXQ=0 NCTRL=0 NCI=11 NIC=0 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: NCONNECTS=0 NHDWRTIMEOUTS=0 OPEN=FALSE CONNECTED=FALSE 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: RXFIFO=empty TXFIFO=almost empty,empty DOM_RXFIFO=notfull 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Tue Jul 10 22:44:09 UTC 2007]FPGA regs for card 0: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:FPGA registers: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:CTRL 0x6800700f 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:GSTAT 0x00003dc2 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:DSTAT 0x000000ff 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:TTSIC 0x0000ffff 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:RTSIC 0xef0000ff 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:INTEN 0x00010200 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:DOMS 0xff0000ff 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:MRAR 0x36be6c00 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:MRTC 0x04000000 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:MWAR 0x36be9804 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:MWTC 0x07000000 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:CURL 0xe19e8c14 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:DCUR 0x3c6930b1 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:FLASH 0xff2e0324 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:DOMC 0x00000000 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:CERR 0x400b000c 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:DCREV 0x00000024 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:FREV 0x09010471 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:stringHub icecube.daq.stringhub.DOMConnector-StateTask ERROR [Tue Jul 10 22:44:11 UTC 2007]Configure timed out.

dglo commented 3 years ago

[jacobsen on 2007-07-11 20:15:54] In 7/7 runs, presence of dropped doms at startup correlates with dropped packets on the DOR RX side (BADPKT, sometimes BADSEQ and BADHDR), and RESENT packets (dropped on the DOR TX side).

[jacobsen@pub runs]$ find 10888[1-9] -name 'stringHub.log' -exec grep -H RESENT {} \; 108881_20070709_054316_009846/daqrun108881/stringHub-56.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=30 PKTS=41 ACKS=7 108881_20070709_054316_009846/daqrun108881/stringHub-56.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=30 PKTS=41 ACKS=7 108881_20070709_054316_009846/daqrun108881/stringHub-57.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=31 PKTS=43 ACKS=8 108883_20070709_214815_028811/daqrun108883/stringHub-39.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=30 PKTS=41 ACKS=7 108885_20070710_064015_028813/daqrun108885/stringHub-56.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=31 PKTS=42 ACKS=7 108887_20070710_224310_028811/daqrun108887/stringHub-56.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=32 PKTS=43 ACKS=7 108887_20070710_224310_028811/daqrun108887/stringHub-48.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=30 PKTS=41 ACKS=7 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:TX: 26B, MSGS=4 NOUTQ=0 RESENT=32 PKTS=43 ACKS=7 [jacobsen@pub runs]$ find 10888[1-9] -name 'stringHub.log' -exec grep -H BAD {} \; 108881_20070709_054316_009846/daqrun108881/stringHub-56.log: BADPKT=9 BADHDR=0 BADSEQ=0 NCTRL=0 NCI=2 NIC=65 108881_20070709_054316_009846/daqrun108881/stringHub-56.log: BADPKT=10 BADHDR=3 BADSEQ=0 NCTRL=0 NCI=2 NIC=55 108881_20070709_054316_009846/daqrun108881/stringHub-57.log: BADPKT=9 BADHDR=0 BADSEQ=9 NCTRL=0 NCI=2 NIC=59 108883_20070709_214815_028811/daqrun108883/stringHub-39.log: BADPKT=8 BADHDR=8 BADSEQ=0 NCTRL=0 NCI=2 NIC=47 108885_20070710_064015_028813/daqrun108885/stringHub-56.log: BADPKT=8 BADHDR=3 BADSEQ=0 NCTRL=0 NCI=2 NIC=58 108887_20070710_224310_028811/daqrun108887/stringHub-56.log: BADPKT=12 BADHDR=0 BADSEQ=0 NCTRL=0 NCI=2 NIC=68 108887_20070710_224310_028811/daqrun108887/stringHub-48.log: BADPKT=8 BADHDR=5 BADSEQ=1 NCTRL=0 NCI=2 NIC=51 108888_20070711_021525_012655/daqrun108888/stringHub-56.log: BADPKT=12 BADHDR=11 BADSEQ=0 NCTRL=0 NCI=2 NIC=59 [jacobsen@pub runs]$ find 10888[1-9] -name 'stringHub.log' -exec grep -H Timeout {} \; 108881_20070709_054316_009846/daqrun108881/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-23A ERROR [Mon Jul 09 02:58:37 UTC 2007]Timeout on trial 0 getting DOM ID 108881_20070709_054316_009846/daqrun108881/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Mon Jul 09 02:58:37 UTC 2007]Timeout on trial 0 getting DOM ID 108881_20070709_054316_009846/daqrun108881/stringHub-57.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-00B ERROR [Mon Jul 09 02:58:47 UTC 2007]Timeout on trial 0 getting DOM ID 108883_20070709_214815_028811/daqrun108883/stringHub-39.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-01A ERROR [Mon Jul 09 13:47:41 UTC 2007]Timeout on trial 0 getting DOM ID 108885_20070710_064015_028813/daqrun108885/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Mon Jul 09 22:39:39 UTC 2007]Timeout on trial 0 getting DOM ID 108887_20070710_224310_028811/daqrun108887/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Tue Jul 10 14:42:35 UTC 2007]Timeout on trial 0 getting DOM ID 108887_20070710_224310_028811/daqrun108887/stringHub-48.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-22A ERROR [Tue Jul 10 14:42:35 UTC 2007]Timeout on trial 0 getting DOM ID 108888_20070711_021525_012655/daqrun108888/stringHub-56.log:stringHub icecube.daq.domapp.DataCollector-DataCollector-02A ERROR [Tue Jul 10 22:44:09 UTC 2007]Timeout on trial 0 getting DOM ID

However, there are NO hardware timeouts.\n\n

dglo commented 3 years ago

[tstezelberger on 2007-07-13 15:23:51] Log of softboot tests on SPS

The test script:

  1. DOM is in iceboot
  2. load domapp FPGA
  3. verify DOMs are still in iceboot(software) (this should detect if a DOM dropped out.
  4. test domapp fpga is loaded
  5. write loop number to DOMs
  6. softboot DOMs
  7. verify DOMs are in iceboot(software) (this should detect if a DOM dropped out.
  8. test stf fpga is loaded
  9. GOTO 1.

Rationale: JohnJ identified 4 types of softboot issues. Type 1 (garbage in the DOR RX buffer) is interesting. Type 2 could be the same but on the DOM side. Analyzing the fifo data can give us a clue what is going on. Unfortunately, for debugging domapp uses binary messages and therefore close to impossible to analyze. This test uses only ASCII messages.

The test: The test ran on string 29 and 56 and used dor-driver-02.11.02-0.i386.rpm and dor-driver-02.11.04-0.i386.rpm

The fifo dumps are attached in softboot-sps-test-071122007.tgz dor-driver-02.11.03-0.i386.rpm: string29 fifo02a-1.filtered.txt string29 fifo02a-2.filtered.txt

dor-driver-02.11.02-0.i386.rpm: string29 fifo02a-3.filtered.txt string56 fifo10a.filtered.txt

So far I could not find the ASCII messages in the fifo dumps

dglo commented 3 years ago

[jacobsen on 2007-11-26 17:36:46] see email from thorsten 5/11/07

Note added 10-22-07: this catchall issue covers the "dropped DOMs" in general at a low level. The problem has been seen at a variety of points, starting from DOM hardware timeouts during loading of domapp.sbi (build 486), and continuing through later stages of configuration (see previous notes on "type 1" and "type 2" dropped DOMs). Various workarounds in pDAQ have addressed this issue by retrying operations, with varying success.