epics-modules / mrfioc2

EPICS driver for Micro Research Finland event timing system devices
http://epics-modules.github.io/mrfioc2/
Other
8 stars 30 forks source link

Data buffer Rx checksum error #19

Closed daykin closed 7 months ago

daykin commented 5 years ago

First documented here: https://jira.frib.msu.edu/browse/GTS-149 Upon losing event connection abruptly, our IOC shell gets flooded with the following message: Data buffer Rx checksum error Data buffer Rx checksum error Data buffer Rx checksum error ...

I attached the gdb to the process, and the backtrace when this occurs is as follows:

Thread 9 "cbHigh" hit Breakpoint 1, 0x0000555c763ea9f0 in defaulterr(void*, int, unsigned int, unsigned char const*) ()
(gdb) backtrace
#0  0x0000555c763ea9f0 in defaulterr(void*, int, unsigned int, unsigned char const*) ()
#1  0x0000555c763e9947 in mrmBufRx::drainbuf(callbackPvt*) ()
#2  0x0000555c7646802c in callbackTask ()
#3  0x0000555c764db060 in start_routine ()
#4  0x00007fbbfedf24a4 in start_thread (arg=0x7fbbfcaa5700) at pthread_create.c:456
#5  0x00007fbbfdc3ed0f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:97

I looked into #0's defaulterr (bufrxmgr.cpp:55), this occurs on epicsStatus 2: case 2: errlogPrintf("Data buffer Rx checksum error\n"); break;

in #1, mrmBufRx::drainbuf (drvemRxBuf.cpp), I see the following on lines 91-92:

    if (sts&DataBufCtrl_sumerr) {
        self.haderror(2);

so I then looked to what this means: according to evrRegMap.h, This is bit 13 of a register called DataBufCtrl . But now I have hit a dead end- the MRF-supplied manual doesn't document this register any further than its address, so I have no clue what is causing this flood of output.

All we know is that it occurs sometimes but not every time we pull the EVG plug, or the EVG goes down abruptly. If we shut down the EVG IOC gracefully, we do not see the problem.

mdavidsaver commented 5 years ago

I've never seen this before. I guess you interrupted a message? This bit is probably write 1 to clear. You could try adding the following to the conditional:

BITSET(NAT,32,evr->base, DataBufCtrl, DataBufCtrl_sumerr);
mdavidsaver commented 5 years ago

cf http://mrf.fi/fw/DCManual-170209.pdf

DBCS    Data Buffer Checksum Error (read-only)
        Flag is cleared by writing ‘1’ to DBRX or DBRDY or disabling data buffer
mdavidsaver commented 3 years ago

Any update on this? Have there been further occurrences?

jerzyjamroz commented 8 months ago

Is this still valid?

daykin commented 7 months ago

I have not seen this for a long time, since we're now in an operational state and facilities is no longer doing "Let's see what happens if we kill the power" tests.

I'm not sure yet why I couldn't reproduce it on our test environment with the PR above, since Jukka et al. are quite certain it won't affect anything. Maybe it was just lucky?

If it happens again, I'll let you know.