GSI-CS-CO / bel_projects

GSI Timing Gateware and Tools
14 stars 15 forks source link

VETAR with "balloon" release fails at GSI_TM_LATCH_V2 on CES RIO4 #25

Closed jamusch closed 7 years ago

jamusch commented 7 years ago

1) Description of environment:


Hardware:

CES RIO4 8070/72 VME with VETAR board


System:

R4L-3 adamczew > uname -a Linux R4L-3 2.6.33.7-shl-3.3-up #1 PREEMPT Wed Oct 5 08:23:43 CEST 2016 ppc ppc ppc GNU/Linux (Sugarhat (3.3.10))


Wishbone info:

R4L-3 adamczew > eb-ls dev/wbm0 BusPath VendorID Product BaseAddress(Hex) Description 1 --- 2 --- 3 --- 4 --- 5 --- 6 --- 7 --- 8 0000000000000651:8752bf45 7ffffff0 ECA_UNIT:EVENTS_IN 9 0000000000000651:eef0b198 4000000 WB4-Bridge-GSI
9.1 --- 9.2 0000000000000651:2d39fa8b 4000800 GSI:BUILD_ID ROM
9.3 0000000000000651:b6232cd3 4000000 GSI:WATCHDOG_MUTEX 9.4 0000000000000651:fab0bdd8 4001000 GSI:MSI_MAILBOX
9.5 0000000000000651:5cf12a1c 5000000 SPI-FLASH-16M-MMAP 9.6 0000000000000651:3a362063 4000010 FPGA_RESET
9.7 0000000000000651:eef0b198 4040000 WB4-Bridge-GSI
9.7.1 000000000000ce42:66cfeb52 4040000 WB4-BlockRAM
9.7.2 0000000000000651:eef0b198 4060000 WB4-Bridge-GSI
9.7.2.1 000000000000ce42:ab28633a 4060000 WR-Mini-NIC
9.7.2.2 000000000000ce42:650c2d4f 4060100 WR-Endpoint
9.7.2.3 000000000000ce42:65158dc0 4060200 WR-Soft-PLL
9.7.2.4 000000000000ce42:de0d8ced 4060300 WR-PPS-Generator
9.7.2.5 000000000000ce42:ff07fc47 4060400 WR-Periph-Syscon
9.7.2.6 000000000000ce42:e2d13d04 4060500 WR-Periph-UART
9.7.2.7 000000000000ce42:779c5443 4060600 WR-Periph-1Wire
9.7.2.8 0000000000000651:68202b22 4060700 Etherbone-Config
9.8 0000000000000651:00000815 6000000 Etherbone_Master
9.9 0000000000000651:10051981 4000100 GSI_TM_LATCH_V2
9.10 0000000000000651:b2afc251 4000200 ECA_UNIT:CONTROL
9.11 0000000000000651:d5a3faea 4000040 ECA_UNIT:QUEUE
9.12 0000000000000651:7c82afbc 4000020 ECA_UNIT:TLU
9.13 0000000000000651:18415778 4000080 ECA ACTCHN WBM
9.14 0000000000000651:d5a3faea 40000c0 ECA_UNIT:QUEUE
9.15 0000000000000651:5f3eaf43 4000300 wb_serdes_clk_gen
9.16 0000000000000651:10c05791 4010000 IO_CONTROL
9.17 0000000000000651:10041000 4080000 LM32-CB-Cluster
9.17.1 0000000000000651:10040086 4080000 Cluster-Info-ROM
9.17.2 --- 9.17.3 0000000000000651:54111351 40a0000 LM32-RAM-User
9.17.4 --- 9.18 0000000000000651:b77a5045 4000400 SERIAL-LCD-DISPLAY 9.19 --- 9.20 --- 9.21 --- 9.22 --- 9.23 --- 9.24 --- 9.25 --- 9.26 0000000000000651:22ffee84 4020000 INFO_VME
9.27 --- 9.28 --- 9.29 ---

R4L-3 adamczew > eb-info dev/wbm0 Project : vetar2a Platform : vetar2a +vetar1db2a +wrex1 FPGA model : Arria II GX (EP2AGX125EF29C5) Source info : balloon-1319 Build type : Balloon_release Build date : Fri Feb 24 04:51:36 CET 2017 Prepared by : Jenkins Nightly Build csco-tg@gsi.de Prepared on : tsl002.acc.gsi.de OS version : Debian GNU/Linux 8.6 (jessie), kernel 3.16.0-4-amd64 Quartus : Version 16.0.0 Build 211 04/27/2016 SJ Standard Edition

dec8dd0 ftm-ctl: added solid generic status function for FESA class 3dea586 ftm: bumped version numbers e67587c Merge branch 'balloon' of github.com:GSI-CS-CO/bel_projects into balloon 3c7ad89 build:change the FPGA version in ArriaV devices 0c9fa34 saftlib: updated to v1.0.9


Kernel module:

R4L-3 adamczew > cat /sys/class/vetar/vetar0/codeversion *** This is VETAR2 xpc/VME driver for CES RIO4 Linux, version 1.1.1 build on Mar 13 2017 at 09:51:22 module authors: Joern Adamczewski-Musch, Cesar Prados, GSI Darmstadt (www.gsi.de) compiled settings: VME windows mapped with new CesXpcBridge lib. wb data and control windows use CesXpcBridge_MasterMap64. VETAR Interrupts are enabled.

See the code at https://subversion.gsi.de/dabc/drivers/vetar/ Note: this software could be part of official git csco release later.

As far as the kernel module is concerned, this has been developed according to submodule ip_core/fpga-config-space/vme-wb as available at git://ohwr.org/hdl-core-lib/fpga-config-space.git This origin code has not been changed since 2015.

Our vetar kernel module has worked without any problem in connection with the VETAR firmware of the "asterisk" branch.


2) Problem description:

In principal, wb_read and wb_write works. Usual tools as eb-ls, eb-info do not show up problems (see output above).

When starting MBS DAQ user space program, access to the FIFO of unit GSI_TM_LATCH_V2 does fail. Writing to wishbone register 0x400015c (corresponds GSI_TM_LATCH_FIFO_POP in gsi_tm_latch.h) will fail with 2 different symptoms, depending on the vme mapping in use:


A) CES vtrans mapping:

write access to such register causses a kernel oops. In the following debug output, all calls of vetar_wb_write are dumped to dmesg: dmesg_r4l-3_vtrans.txt

Mar 13 08:27:15 r4l-3 kernel: vetar driver init... Mar 13 08:27:15 r4l-3 kernel: VETAR vme driver starts probe for index 0 Mar 13 08:27:15 r4l-3 kernel: Use parameters address 0x0, slot number 0x5, lun 0x0 vector 0x60 Mar 13 08:27:15 r4l-3 kernel: Found Vetar vendor ID: 0x00080031 Mar 13 08:27:15 r4l-3 kernel: vetar_probe_vme assigned irq vector=0x60 level=0x2 Mar 13 08:27:15 r4l-3 kernel: VETAR VME windows are mapped with CesXpcBridge lib. Mar 13 08:27:16 r4l-3 kernel: wb data and control windows use VTRANS mapping Mar 13 08:27:16 r4l-3 kernel: VETAR device: vetar0 has been added. Mar 13 08:29:02 r4l-3 kernel: before RESET_SEM Mar 13 08:29:02 r4l-3 kernel: after RESET_SEM Mar 13 08:29:03 r4l-3 kernel: ** Vetar_WB: vetar_wb_write.. Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: WRITE(0x3) => 0x4000158 Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000110 Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 08:29:03 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 08:29:05 r4l-3 kernel: Machine check in kernel mode. Mar 13 08:29:05 r4l-3 kernel: Caused by (from SRR1=149030): Transfer error ack signal Mar 13 08:29:05 r4l-3 kernel: Oops: Machine check, sig: 7 [#1] Mar 13 08:29:05 r4l-3 kernel: PREEMPT CES RIO3/4 Mar 13 08:29:05 r4l-3 kernel: Modules linked in: vetar wishbone trigmod cesSemLib(P) cesDmaIoctl cesDma(P) xpc cesXpcLib(P) cesOsApi(P) cesBus(P) [last unloaded: wishbone] Mar 13 08:29:05 r4l-3 kernel: NIP: e01384c4 LR: e01384b4 CTR: 00000000 Mar 13 08:29:05 r4l-3 kernel: REGS: de40dd20 TRAP: 0200 Tainted: P (2.6.33.7-shl-3.3-up) Mar 13 08:29:05 r4l-3 kernel: MSR: 00149030 <EE,ME,IR,DR> CR: 24244488 XER: 00000000 Mar 13 08:29:05 r4l-3 kernel: TASK = de454cd0[1586] 'm_read_meb' THREAD: de40c000 Mar 13 08:29:05 r4l-3 kernel: GPR00: 0000015c de40ddd0 de454cd0 0000002c de454070 a43f491c 00000000 00000000 Mar 13 08:29:05 r4l-3 kernel: GPR08: 00000000 e0400000 24000482 10000000 ffffffff 100778b4 0000003c de50802f Mar 13 08:29:05 r4l-3 kernel: GPR16: de50802e de50802d 00000014 00000020 de508000 0000000f e013905c 00000004 Mar 13 08:29:05 r4l-3 kernel: GPR24: de3f2e94 000000a0 00000001 0000000f 0400015c 0400015c de3f2e00 04000000 Mar 13 08:29:05 r4l-3 kernel: NIP [e01384c4] vetar_wb_write+0x78/0x94 [vetar] Mar 13 08:29:05 r4l-3 kernel: LR [e01384b4] vetar_wb_write+0x68/0x94 [vetar] Mar 13 08:29:05 r4l-3 kernel: Call Trace: Mar 13 08:29:05 r4l-3 kernel: [de40ddd0] [e01384b4] vetar_wb_write+0x68/0x94 [vetar] (unreliable) Mar 13 08:29:05 r4l-3 kernel: [de40ddf0] [e00bce20] char_master_aio_write+0x36c/0x4bc [wishbone] Mar 13 08:29:05 r4l-3 kernel: [de40de50] [c008a6dc] do_sync_write+0xa8/0x11c Mar 13 08:29:05 r4l-3 kernel: [de40def0] [c008b0e0] vfs_write+0xb4/0x158 Mar 13 08:29:05 r4l-3 kernel: [de40df10] [c008b73c] sys_write+0x4c/0x90 Mar 13 08:29:05 r4l-3 kernel: [de40df40] [c00136cc] ret_from_syscall+0x0/0x38 ...

Obviously writing to other registers 0x4000158 (i.e. GSI_TM_LATCH_CH_SELECT), 0x4000104 (i.e. GSI_TM_LATCH_FIFO_CLEAR) and 0x4000110 (i.e. GSI_TM_LATCH_TRIG_ARMSET) does not fail,

but writing to 0x400015c (i.e. GSI_TM_LATCH_FIFO_POP) does crash the bus.

The register access seen in dmesg reflects the order of eb_cycle_write calls in the MBS user code of f_user.c. (in functions f_wr_init() and f_user_readout())


The user output of MBS application just stops after kernel crash:

mbslog_r4l-3_vtrans.txt

mbs> @startup -R4L-3 :util :task m_util started -R4L-3 :util :setup file setup.usf successfully loaded -R4L-3 :util :trigger module set up as MASTER, crate nr: 0 -R4L-3 :util :disabled interrupt -R4L-3 :dispatch :-> 'sleep 1 ' -R4L-3 :read_meb :Pipe type 2 or 4: virtual mapping -R4L-3 :transport :task m_transport started -R4L-3 :collector :Pipe type 2 or 4: virtual mapping -R4L-3 :read_meb :task m_read_meb started -R4L-3 :stream_serv:task m_stream_serv started -R4L-3 :collector :task m_collector started -R4L-3 :dispatch :-> 'sleep 1 ' finished -R4L-3 :transport :starting server in inclusive mode mbs> -R4L-3 :util :start acquisition -R4L-3 :transport :waiting for client (port 6000) -R4L-3 :read_meb :found trig type 14 == start acquisition -R4L-3 :read_meb : -R4L-3 :read_meb :selected White Rabbit TLU FIFO channel number: 3 -R4L-3 :read_meb :size of White Rabbit TLU FIFO: 256 -R4L-3 :read_meb : -R4L-3 :collector :acquisition running


B) Mapping with conventional CesXpcBridge_MasterMap64:

No kernel dump, the access to registers mentioned above seem to work in dmesg:

dmesg_r4l-3_xpclib.txt

Mar 13 09:51:44 r4l-3 kernel: vetar driver init... Mar 13 09:51:44 r4l-3 kernel: VETAR vme driver starts probe for index 0 Mar 13 09:51:44 r4l-3 kernel: Use parameters address 0x0, slot number 0x5, lun 0x0 vector 0x60 Mar 13 09:51:44 r4l-3 kernel: Found Vetar vendor ID: 0x00080031 Mar 13 09:51:44 r4l-3 kernel: vetar_probe_vme assigned irq vector=0x60 level=0x2 Mar 13 09:51:44 r4l-3 kernel: VETAR VME windows are mapped with CesXpcBridge lib. Mar 13 09:51:44 r4l-3 kernel: wb data and control windows use CesXpcBridge_MasterMap64. Mar 13 09:51:44 r4l-3 kernel: VETAR device: vetar0 has been added. Mar 13 09:52:16 r4l-3 kernel: before RESET_SEM Mar 13 09:52:16 r4l-3 kernel: after RESET_SEM Mar 13 09:52:17 r4l-3 kernel: ** Vetar_WB: vetar_wb_write.. Mar 13 09:52:17 r4l-3 kernel: Vetar_WB: WRITE(0x3) => 0x4000158 Mar 13 09:52:17 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:17 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:17 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:17 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000110 Mar 13 09:52:18 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:18 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 09:52:18 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:18 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:19 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:19 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 09:52:19 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:19 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:20 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:20 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 09:52:20 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:20 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:21 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:21 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 09:52:22 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:22 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:23 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:23 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 09:52:23 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:23 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:24 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:24 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c Mar 13 09:52:24 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:24 r4l-3 kernel: Vetar_WB: WRITE(0xffffffff) => 0x4000104 Mar 13 09:52:25 r4l-3 kernel: Vetar_WB: vetar_wb_write.. Mar 13 09:52:25 r4l-3 kernel: Vetar_WB: WRITE(0xf) => 0x400015c

NOTE that the continous writing to those initialization registers is caused by a repeated reinit routine called by the mbs application after detecting an error (reset TLU fifo etc.)

The MBS user output finds inconsistencies after reading back values via wishbone:

mbslog_r4l-3_xpclib.txt

-R4L-3 :util :start acquisition -R4L-3 :transport :waiting for client (port 6000) -R4L-3 :read_meb :found trig type 14 == start acquisition -R4L-3 :read_meb : -R4L-3 :read_meb :selected White Rabbit TLU FIFO channel number: 3 -R4L-3 :read_meb :size of White Rabbit TLU FIFO: 256 -R4L-3 :read_meb : -R4L-3 :read_meb :ERROR>> TLU fifo 3 is empty before time stamp read, stat: 0x200 -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (1) (0xbad00bad) -R4L-3 :read_meb :reset TLU fifo of white rabbit time stamp module vetar -R4L-3 :collector :acquisition running -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (2) (0xbad00bad) -R4L-3 :read_meb : -R4L-3 :read_meb :ERROR>> TLU fifo 3 is empty before time stamp read, stat: 0x200 -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (1) (0xbad00bad) -R4L-3 :read_meb :reset TLU fifo of white rabbit time stamp module vetar -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (2) (0xbad00bad) -R4L-3 :read_meb : -R4L-3 :read_meb :ERROR>> TLU fifo 3 is empty before time stamp read, stat: 0x200 -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (1) (0xbad00bad) -R4L-3 :read_meb :reset TLU fifo of white rabbit time stamp module vetar -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (2) (0xbad00bad) -R4L-3 :read_meb : -R4L-3 :read_meb :ERROR>> TLU fifo 3 is empty before time stamp read, stat: 0x200 -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (1) (0xbad00bad) -R4L-3 :read_meb :reset TLU fifo of white rabbit time stamp module vetar -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (2) (0xbad00bad) -R4L-3 :read_meb : -R4L-3 :read_meb :ERROR>> TLU fifo 3 is empty before time stamp read, stat: 0x200 -R4L-3 :read_meb :white rabbit failure: invalidate current trigger/event (1) (0xbad00bad) -R4L-3 :read_meb :reset TLU fifo of white rabbit time stamp module vetar

Again, reading and writing via eb_cycle works in principle but the GSI_TM_LATCH_V2 unit fifo does not. This device, however, is currently crucial for the DAQ systems of several NUSTAR experiments and should work properly and please be furtherly maintained.

For previous asterisk VETAR FPGA firmware, the above errors never occured with the same setup of hardware, kernel module and software.

jamusch commented 7 years ago

After updating the vetar vme kernel module: To countercheck the issue, our vme driver (see https://subversion.gsi.de/dabc/drivers/vetar/) has been adjusted to the latest modifications concerning git tags d5187fef4d33b62d77e1e087d8a0f148150bc5c8o (wishbone module) and the branch vme-drv (tag36fa80db7ca2e29767e81219330a36b09d33a652, not yet part of balloon?).

This would correct for the fact that before we had not provided the MSI functionality yet, that is required for the ECA technique of saftlib. So now our driver is mostly consistent with the other vme driver software distributed with balloon release.

However, the MBS DAQ timestamp readout uses neither ECA/MSI nor interrupts, but the time latch unit GSI_TM_LATCH_V2. So the interrupt handler is registered by the driver (as it was before), but is never invoked in the scope of our application (which was also the case with the asterisk firmware on vetar board). For latency reasons our time stamp fifo readout is always invoked after the DAQ trigger interrupt of the MBS trigger module and is not intended to use an addtional interrupt of the VETAR board.

So these improvements in our kernel module do not change anything for the problem as described above.

miree commented 7 years ago

The issue was caused by me flashing balloon release of the normal vetar2a gateware instead of the vetar2a-ee-butis gateware (which has different IO mapping). The vetar2a-ee-butis gateware was missing in the balloon release. It is now available under the branch balloon_vetar_ee