RNO-G / mattak

RNO-G dataformats
1 stars 1 forks source link

infinite trigger times in 2022 runs (again...) #3

Open shallmann opened 2 years ago

shallmann commented 2 years ago

It seems that the infinite trigger time issue is not solved completely. @jakobhenrichs has run on the inbox data and found multiple runs with infinite trigger times.

I think the issue is an uncatched division zero case here when the sysclk last / lastlast are identical: https://github.com/RNO-G/mattak/blob/a839d9941a48a612836aec915121e27876f5800c/src/Header.cc#L32

One example is station 22, run 1708. When I run my old python script that was generated to investigate the original issue (https://github.com/RNO-G/rnog-data-analysis-and-issues/tree/main/rno_g_data_tools/offline_fixes), one can see that last == last_last is causing the issue...

RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (2112347478) / last last: (2112347478), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661800121.1246378
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (2742564739) / last last: (2742564739), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661800385.3104992
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (102996050) / last last: (102996050), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661800874.41968
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (2328180394) / last last: (2328180394), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661801111.08667
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (853364404) / last last: (853364404), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661801311.0865848
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (1003729162) / last last: (1003729162), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661801742.270493
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (2403730612) / last last: (2403730612), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661801756.4116752
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (3889348762) / last last: (3889348762), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661802501.0990398
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (3709493287) / last last: (3709493287), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661802671.1181803
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (1524597371) / last last: (1524597371), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661802778.8017113
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (1929632720) / last last: (1929632720), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661802825.6478596
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (2029632763) / last last: (2029632763), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661802826.0374522
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (3234669413) / last last: (3234669413), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661802881.0861583
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (4190071151) / last last: (4190071151), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661803363.1105042
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (1530360376) / last last: (1530360376), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661803680.4282646
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (2095836821) / last last: (2095836821), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661804244.4326854
RNOG_ROOT_OFFLINE_FIX - ERROR - Same sysclk last: (1341157908) / last last: (1341157908), assuming 100MHz
RNOG_ROOT_OFFLINE_FIX - ERROR - Problematic time: inf -> 1661804623.1009407
shallmann commented 2 years ago

@cozzyd should we add something similar to the python code (which assumes 100 MHz as the first best guess if sysclk last is equal to lastlast)?

    if sysclk_diff_last == 0:
        logger.error("Same sysclk last: (%s) / last last: (%s), assuming 100MHz", sysclk_last_pps, sysclk_last_last_pps)
        sysclk_diff_last = 1e8
cozzyd commented 2 years ago

hmm so the PPS is missed sometimes? That's... strange. Is there a pattern to when this happens?

shallmann commented 2 years ago

... we didn't check. You mean if it is at the beginning of the run? Seems to happen way too often to happen by chance :D But yes, missing the PPS would be an explanation... I'll have a look

shallmann commented 2 years ago

@cozzyd so it seems there are two and a half failure modes,

  1. all three sysclks are zero for the entire run. From what I've seen these are short runs, and most except the station/run number are zero / nan. These runs seem to be garbage to me.
  2. the sysclk last and sysclk last last are identical. The sysclk seems ok. This is independent of trigger flag and spread evenly over large portions of a run (starting somewhere during the run, until the end of the run) 2.5: the sysclk last and sysclk last last are identical for a single event somewhere in the run.

I attach a full list of runs. I hope that helps in narrowing down the issue. broken_timing.txt

cozzyd commented 6 months ago

Here is an example of this problem, in run826 of station 13:

event: 1158, tinfo:0x0006, evstatus:0x0000, pps_num:2345, sysclk: 2592643122, sysclk_last_pps: 2571972937, sysclk_last_last_pps:2471972888, unrolled_sysclk: 11182577714, unrolled_sysclk_div100MHz: 111.825777, readout_time:1662864755.179806769
event: 1159, tinfo:0x0022, evstatus:0x0000, pps_num:2346, sysclk: 2685410046, sysclk_last_pps: 2671972990, sysclk_last_last_pps:2571972937, unrolled_sysclk: 11275344638, unrolled_sysclk_div100MHz: 112.753446, readout_time:1662864756.107543892
event: 1160, tinfo:0x0004, evstatus:0x0000, pps_num:2347, sysclk: 2843273972, sysclk_last_pps: 2671972990, sysclk_last_last_pps:2671972990, unrolled_sysclk: 11433208564, unrolled_sysclk_div100MHz: 114.332086, readout_time:1662864757.686054707
event: 1161, tinfo:0x0004, evstatus:0x0000, pps_num:2349, sysclk: 2977067567, sysclk_last_pps: 2771973030, sysclk_last_last_pps:2871973077, unrolled_sysclk: 11567002159, unrolled_sysclk_div100MHz: 115.670022, readout_time:1662864759.023987158
event: 1162, tinfo:0x0004, evstatus:0x0000, pps_num:2351, sysclk: 3173595549, sysclk_last_pps: 2971973137, sysclk_last_last_pps:3071973179, unrolled_sysclk: 11763530141, unrolled_sysclk_div100MHz: 117.635301, readout_time:1662864760.988979688
event: 1163, tinfo:0x0004, evstatus:0x0000, pps_num:2352, sysclk: 3281909695, sysclk_last_pps: 3171973238, sysclk_last_last_pps:3171973238, unrolled_sysclk: 11871844287, unrolled_sysclk_div100MHz: 118.718443, readout_time:1662864762.072229374
event: 1164, tinfo:0x0008, evstatus:0x0000, pps_num:2354, sysclk: 3566768491, sysclk_last_pps: 3271973321, sysclk_last_last_pps:3371973405, unrolled_sysclk: 12156703083, unrolled_sysclk_div100MHz: 121.567031, readout_time:1662864764.920911695
event: 1165, tinfo:0x0004, evstatus:0x0000, pps_num:2356, sysclk: 3696805044, sysclk_last_pps: 3471973490, sysclk_last_last_pps:3571973565, unrolled_sysclk: 12286739636, unrolled_sysclk_div100MHz: 122.867396, readout_time:1662864766.221361329
event: 1166, tinfo:0x0004, evstatus:0x0000, pps_num:2361, sysclk: 4266151311, sysclk_last_pps: 3671973667, sysclk_last_last_pps:4071973889, unrolled_sysclk: 12856085903, unrolled_sysclk_div100MHz: 128.560859, readout_time:1662864771.914667028
event: 1167, tinfo:0x0008, evstatus:0x0006, pps_num:2364, sysclk: 269261833, sysclk_last_pps: 4171973954, sysclk_last_last_pps:77006830, unrolled_sysclk: 13154163721, unrolled_sysclk_div100MHz: 131.541637, readout_time:1662864774.895423449
event: 1168, tinfo:0x0004, evstatus:0x0002, pps_num:2365, sysclk: 303064437, sysclk_last_pps: 177006937, sysclk_last_last_pps:177006937, unrolled_sysclk: 13187966325, unrolled_sysclk_div100MHz: 131.879663, readout_time:1662864775.233474359
event: 1169, tinfo:0x0004, evstatus:0x0002, pps_num:2366, sysclk: 408166875, sysclk_last_pps: 277007038, sysclk_last_last_pps:277007038, unrolled_sysclk: 13293068763, unrolled_sysclk_div100MHz: 132.930688, readout_time:1662864776.284457968
event: 1170, tinfo:0x0004, evstatus:0x0002, pps_num:2374, sysclk: 1264605290, sysclk_last_pps: 377007117, sysclk_last_last_pps:1077007641, unrolled_sysclk: 14149507178, unrolled_sysclk_div100MHz: 141.495072, readout_time:1662864784.848644048
cozzyd commented 6 months ago

You can see that at some point sysclk_last_last_pps overtakes sysclk. @barawn , this obviously should "never" happen, but maybe some synchronization issue?

barawn commented 6 months ago

There are FIFOs for each dword, so they can get out of sync. I just have no idea how it would happen for some and not others.

barawn commented 6 months ago

@cozzyd I should clarify that if there's a mechanism for the FIFOs to get out of sync they have to eventually underflow or overflow. They're only 16 deep.

You can directly test the overflow or underflow as is: if you read from the header FIFO address space (0x30100-0x301ff) before enabling the trigger it'll obviously underflow, and that status bit should be set once the event readout starts. (Because it's a benign underflow the data will be perfectly fine).

You can test overflow by just changing the event DMA descriptors to remove one of the header reads (so for example change the length to 7 instead of 8 or something) - after 16 events the 'unread' buffer will overflow and that bit will be set.

Both overflow/underflow are sticky so once they're set, they're set until the FIFO is reset (write 1 to 0x30000).

There are a couple of changes in the firmware I can make to help diagnose it, but I need to poke the PSU guys again since they never got back to me.

cozzyd commented 6 months ago

So if it underflowed/overflowed, you'd expect some high bits to be set in ev_status eventually?

at least based on: https://github.com/RNO-G/firmware-radiant/blob/0201484d0464142d9114a4572d57cd52f94f9ab1/hdl/radiant_event_ctrl.v#L123

But I don't see those set in the data? Or perhaps, the fifo can overflow/underflow without those bits set? (I see that https://docs.xilinx.com/v/u/en-US/fifo_generator_ds317 has overflow/underflow indicators that I don't think you're directly using?). I guess this is why you want me to test it by inducing underflow/overflow (which I haven't done yet...).

I think perhaps we should move this issue (and the related one) to firmware-radiant, so that maybe PSU can lay eyes on it (or even fix it!)

barawn commented 6 months ago

yeah, that's a good summary. And yeah, not using the internal under/overflow indicators would be part of the change, although the logic is identical to what I'm using - I just added those instead of recustomizing the core.

It probably makes sense to create a new issue at firmware-radiant referencing these though. Will do that.