RNO-G / mattak

RNO-G dataformats
1 stars 1 forks source link

Trigger times sometimes off by 1 second #46

Open shallmann opened 3 months ago

shallmann commented 3 months ago

The current way the trigger time is constructed is insufficient!

The sub-second part is determined from the counter values (sysclk) of a ~100 MHz oscillator stored upon trigger and the last two PPS pulses. There is a long-standing issue which might be solvable in firmware, and an approximate solution which can be applied in mattak here: https://github.com/RNO-G/mattak/issues/3

Now here is another issue:

The second part is determined from the readout time (minus one if the subsecond part of the trigger time > subsecond part of the readout time) here: https://github.com/RNO-G/mattak/blob/0143ee2ee7fa01de3abbdcfaa6b84a5dd8ef4dcb/src/Header.cc#L36. The simple "minus 1" by comparing the two is insufficient, due to time passing between trigger and readout.

Here is a demonstration, trigger times inconsistent with the events before/after are marked in red:

Screenshot 2024-03-13 at 13 34 27
shallmann commented 3 months ago

Screenshot 2024-03-13 at 14 02 26

for completeness... this is the current time difference between readout and trigger times stored for the same run

cozzyd commented 3 months ago

I think the problem is that the code might assume that readout_time is always larger than trigger time, but due to the clock drift, even with NTP, the CPU clock may fluctuate enough that even with the time it takes to readout, the readout time appears "before" the trigger time.

barawn commented 3 months ago

@cozzyd BTW you can't trust that it's always going to be 100M. It's sysclk - it's always going to be 1/32nd the sampling frequency. So if you slow the sampling freq sysclk slows.

Really I feel like when a run is reprocessed the timing should just be rebuilt based on the extracted sysclks and sanity checks. The sysclk counter itself is free running, so really the PPS values are just there to help figure out the overall time. If you fed everything into a simple digital filter it'd be easy to deal with weird glitches and such.

cozzyd commented 3 months ago

yes, we don't trust it's 100 M, we use sysclk_last_pps - sysclk_last_last_pps (which is off by a second, but close enough). Yes we should probably filter that distribution for post-processing.

For this particular problem, the right thing to do is probably just find the overall offset between pps_count and readout_time over the run.

shallmann commented 3 months ago

Ok, this is mainly to document findings, and discuss how to proceed.

  1. depending on the offset between trigger and readout clock the current fractions where the trigger time stored is 1 second early ranges from close to 0 to close to 1 in a run ... I.e. globally a sizable fraction O(up to ~50%) of the trigger times currently stored are off

  2. I tried 2 quick hacky offline fixes

Although both of them make things better, none of them is perfect. I will attach some examples for demonstration below.

shallmann commented 3 months ago

image

... this is the problematic case... unless the hardware reads out 20 secs (!) after the trigger, it looks like pps clicks are missed. These cases would need some further investigation, but it might be in that case that the sub-second part is also wrong since we divide by (pps_last - pps_last_last) ... These runs will be hard to fix correctly (and involves checking the sub-second part is done correctly judging by the pps_last/last_last difference if the difference is too large for only 1 pps difference)

shallmann commented 3 months ago

Just for demonstration, a second case of the "much delayed" distribution... which actually looks like the readout could be in fact delayed by few seconds here image image

And this run definitely looks like PPSes are missed: image image

cozzyd commented 3 months ago

It should never take so long to read out an event, unless somehow the CPU is swamped doing something different for a very long time? (running out of memory or something and waiting for the oom_killer to do something? but that "shouldn't" happen...I've never seen evidence that the oom_killer is doing anything).

In station 24, run 1132, I don't see any evidence of a missed pps in sysclk_last_pps - sysclk_last_last_pps, but I suppose it could happen in a second where there is no event in the next second.

image

But this is strange: image

And this indicates we are certainly jumping seconds at some point: image

cozzyd commented 3 months ago

also no obvious missed PPS problem here either (and in fact you can see the CPU clock jitter in here pretty well)

image

cozzyd commented 3 months ago

(well it's not really the CPU clock jitter only, also has to do with when the thread servicing the force triggers is scheduled... basically a force trigger is sent if that thread notices it's been longer than X time since the last one, it's not real time at all, a better way to see the clock jitter would be to look at the readout_time of a calpulser run).

barawn commented 3 months ago

There are also the status flags in the header that I'm not ever sure you check, but you'd be seeing weirdness in the event numbers if something happened there, though.

The roll flags will technically be more accurate than the check to see if the new clock is less than the last clock, but that would only matter if events are separated by a narrow time range - technically I guess I should just capture the entire 48-bit counter and compare that...

ooh I actually just realized I can do that for free if I'm clever, now I might need to write another module that does that

cozzyd commented 3 months ago

Do you have some debouncing on sysclk_last_pps / sysclk_last_last_pps that isn't applied to the pps_num?

barawn commented 3 months ago

No, they're all driven off of the same signal, which is why this doesn't make any sense.

cozzyd commented 3 months ago

Something funky is going on at the end of this run:

image

cozzyd commented 3 months ago

it's as if pps_num is jumping by several at a time, but somehow not detected by sysclk_last_pps - syclk_last_last_pps (which you'd expect to be very small sometimes if jumping quickly). Ok, we could miss a few since we don't have an event every second but, we shouldn't miss them all?

barawn commented 3 months ago

Wait, no, you do want to look at the status flags, I think: if the FIFO gets completely screwed up that might cause problems.

last sysclk/lastlast sysclk will always match because the data in the FIFO is always in sync

but then you'd be missing event counts and I'm guessing someone would've already checked that??

barawn commented 3 months ago

I'm probably flailing in trying to think through this because like I said, the logic for the three is identical: image and I traced through the signals and there's no cross-clock issue since these are all purely in sysclk.

cozzyd commented 3 months ago

the event number is monotonically increasing with no gaps. The flags don't currently make it to the ROOT trees, but let me look at the raw data (which just passes through the event status flags from the firmware header without doing anything with them.)

root [30] while (rno_g_header_read(f,&h)) { printf("event: %d, tinfo:0x%04x, evstatus:0x%04x, pps_num:%u, sysclk: %u, readout_time:%d.%09d\n",  h.event_number, h.raw_tinfo, h.raw_evstatus, h.pps_count, h.sys_clk, h.readout_time_secs, h.readout_time_nsecs); }
event: 3723, tinfo:0x000a, evstatus:0x0002, pps_num:7128, sysclk: 4275666327, readout_time:1683537314.574915555
event: 3724, tinfo:0x0006, evstatus:0x0004, pps_num:7129, sysclk: 331665521, readout_time:1683537318.085250669
event: 3725, tinfo:0x0006, evstatus:0x0000, pps_num:7133, sysclk: 378448045, readout_time:1683537318.552851555
event: 3726, tinfo:0x0022, evstatus:0x0000, pps_num:7133, sysclk: 971960615, readout_time:1683537324.488429365
event: 3727, tinfo:0x0008, evstatus:0x0000, pps_num:7139, sysclk: 982953318, readout_time:1683537324.598189554
event: 3728, tinfo:0x0004, evstatus:0x0000, pps_num:7139, sysclk: 1959087146, readout_time:1683537334.360621811
event: 3729, tinfo:0x0004, evstatus:0x0000, pps_num:7149, sysclk: 1966213655, readout_time:1683537334.431881206
event: 3730, tinfo:0x0008, evstatus:0x0000, pps_num:7149, sysclk: 1980520428, readout_time:1683537334.574915009
event: 3731, tinfo:0x0004, evstatus:0x0000, pps_num:7149, sysclk: 2490499529, readout_time:1683537339.675179267
event: 3732, tinfo:0x0004, evstatus:0x0000, pps_num:7154, sysclk: 2705908710, readout_time:1683537341.829472178
event: 3733, tinfo:0x0008, evstatus:0x0000, pps_num:7156, sysclk: 2980430980, readout_time:1683537344.574917866
event: 3734, tinfo:0x0004, evstatus:0x0000, pps_num:7159, sysclk: 3289456901, readout_time:1683537347.665305019
event: 3735, tinfo:0x0004, evstatus:0x0000, pps_num:7162, sysclk: 3555758522, readout_time:1683537350.328814615
event: 3736, tinfo:0x0004, evstatus:0x0000, pps_num:7165, sysclk: 3980170864, readout_time:1683537354.572999926
event: 3737, tinfo:0x0008, evstatus:0x0000, pps_num:7169, sysclk: 3980338492, readout_time:1683537354.597746669
event: 3738, tinfo:0x0004, evstatus:0x0000, pps_num:7169, sysclk: 4103717946, readout_time:1683537355.808874554
event: 3739, tinfo:0x0004, evstatus:0x0000, pps_num:7170, sysclk: 4195725877, readout_time:1683537356.728974600
event: 3740, tinfo:0x0004, evstatus:0x0006, pps_num:7171, sysclk: 459851251, readout_time:1683537362.320218180
event: 3741, tinfo:0x0020, evstatus:0x0002, pps_num:7177, sysclk: 515406850, readout_time:1683537362.875822737
event: 3742, tinfo:0x000a, evstatus:0x0002, pps_num:7177, sysclk: 685299474, readout_time:1683537364.574862643

From the firmware: // 5: Status flags. Bits 2/1/0 are event number/count/PPS rollover indicators. // This is a bit pointless since you should *know* when that happens, but, hey. // Makes most sense for count, but even then not really, since rolls only happen // every 43 seconds. I might need to do something to indicate when the data's // garbage (due to being triggered when dead), but I'll need to improve the DMA // engine for that anyway, probably by adding dynamic request types to the engine

It seems like the problem is somehow when bit 2 is set? Those events seem to take a really long time to readout for some reason?

But then everything is self-consistent. Long readout + basing the second part of the trigger on the readout_time explains everything wrong. The right thing to do is derive trigger_time from pps_num and sys_clk and some logic to find the mode of the offset between readout_time and pps_num.

barawn commented 3 months ago

Yeah, that wording's wrong. Bit 2 is the sysclk counter roll indicator: as in the last event's count had a different sysclk[32] epoch. It's the same thing as you checking to see if the event count dropped since the last one.

barawn commented 3 months ago

but uh bit 1 should never be set, like, never...? what the hell?

bit 0 is event count rollover bit 1 is PPS count rollover, this should take 136 years

barawn commented 3 months ago

Never mind, that's a dumbass bug: the roll sec indicator is checking to see if sysclk[32] matches pps[32]. Oops. Doesn't matter though.

barawn commented 3 months ago

FWIW that idiot bug means that bit[1] is actually the sysclk[32] epoch because the PPS bit [32] is always zero.

Not sure why you think those take longer to readout? The deltas between the sysclk/readout times don't seem any different. Like, the time difference between 3740 and 3741's readout/sysclk time differences are ~31 microseconds (assuming a perfect 100 MHz), which is consistent considering the readout clock drift looks like ~6 microseconds/second.

e.g.

3739 1683537356.728974600 4195725877
3740 1683537362.320218180 4754818547
3741 1683537362.875822737 4810374146

delta-T between 3739/3740 is 5591240000 ns in readout, 559092670 clocks = 10.0005603722188 ns/clk delta-T between 3740/3741 is 555609920 ns in readout, 55555599 clocks = 10.000970739 ns/clk etc. (drift of a few parts in 1E-6)

cozzyd commented 3 months ago

Yeah that was me miscounting the number of digits in the number after the rollover...

But the strange part is that pps_count is not incrementing properly, then, almost like it's the pps_count of the wrong event.

barawn commented 3 months ago

are you reading the table right? It's aligned wrong on GitHub. Once I fixed it it seemed OK to me

cozzyd commented 3 months ago

I just fixed the formatting so it's easier to read. It looks like the pps_num is off by one event, right? Like the jump from 7171 to 7177 happens at the wrong time.

barawn commented 3 months ago

weird, I thought I realigned it and it looked right. But yeah, that's what it looks like?

Super don't understand how that could possibly happen. Technically the dword FIFOs are all separate (which they probably shouldn't be) so in this case it's like there was a read that didn't happen? But that's the point of the underflow indicator, so that should've freaked out.

It'd be good to find the point at which it slipped if that's the case...

cozzyd commented 3 months ago

Based on the earlier plot, seems to happen somewhere around here:

root [51] rno_g_init_handle(&f,"header/003135.hd.dat.gz","r")
(int) 0
root [52] while (rno_g_header_read(f,&h)) { uint64_t sysclk = h.sys_clk; if (sysclk < last_sysclk) { nrollover++; }  last_sysclk = sysclk; sysclk += nrollover * (1ull << 32); printf("event: %d, tinfo:0x%04x, evstatus:0x%04x, pps_num:%u, unrolled_sysclk: %lu, unrolled_sysclk_div100MHz: %f, readout_time:%d.%09d\n",  h.event_number, h.raw_tinfo, h.raw_evstatus, h.pps_count, sysclk, sysclk/100e6, h.readout_time_secs, h.readout_time_nsecs); }
event: 3135, tinfo:0x0009, evstatus:0x0002, pps_num:5989, unrolled_sysclk: 1955794502, unrolled_sysclk_div100MHz: 19.557945, readout_time:1683536174.574924846
event: 3136, tinfo:0x0005, evstatus:0x0002, pps_num:5989, unrolled_sysclk: 1990127969, unrolled_sysclk_div100MHz: 19.901280, readout_time:1683536174.918361731
event: 3137, tinfo:0x0005, evstatus:0x0002, pps_num:5993, unrolled_sysclk: 2392977398, unrolled_sysclk_div100MHz: 23.929774, readout_time:1683536178.948016658
event: 3138, tinfo:0x0005, evstatus:0x0002, pps_num:5994, unrolled_sysclk: 2422302436, unrolled_sysclk_div100MHz: 24.223024, readout_time:1683536179.240955944
event: 3139, tinfo:0x0005, evstatus:0x0002, pps_num:5996, unrolled_sysclk: 2651104911, unrolled_sysclk_div100MHz: 26.511049, readout_time:1683536181.533026621
event: 3140, tinfo:0x0009, evstatus:0x0002, pps_num:5999, unrolled_sysclk: 2955595561, unrolled_sysclk_div100MHz: 29.555956, readout_time:1683536184.574914472
event: 3141, tinfo:0x0021, evstatus:0x0002, pps_num:6006, unrolled_sysclk: 3681645118, unrolled_sysclk_div100MHz: 36.816451, readout_time:1683536191.836883402
event: 3142, tinfo:0x0004, evstatus:0x0002, pps_num:6009, unrolled_sysclk: 3941630526, unrolled_sysclk_div100MHz: 39.416305, readout_time:1683536194.437172251
event: 3143, tinfo:0x0008, evstatus:0x0002, pps_num:6009, unrolled_sysclk: 3958567644, unrolled_sysclk_div100MHz: 39.585676, readout_time:1683536194.606531077
event: 3144, tinfo:0x0004, evstatus:0x0002, pps_num:6011, unrolled_sysclk: 4110527859, unrolled_sysclk_div100MHz: 41.105279, readout_time:1683536196.126425332
event: 3145, tinfo:0x0004, evstatus:0x0002, pps_num:6012, unrolled_sysclk: 4277426003, unrolled_sysclk_div100MHz: 42.774260, readout_time:1683536197.795518833
event: 3146, tinfo:0x0004, evstatus:0x0004, pps_num:6014, unrolled_sysclk: 4493572600, unrolled_sysclk_div100MHz: 44.935726, readout_time:1683536199.957584771
event: 3147, tinfo:0x0004, evstatus:0x0000, pps_num:6016, unrolled_sysclk: 4613449630, unrolled_sysclk_div100MHz: 46.134496, readout_time:1683536201.156610570
event: 3148, tinfo:0x0004, evstatus:0x0000, pps_num:6016, unrolled_sysclk: 4690119944, unrolled_sysclk_div100MHz: 46.901199, readout_time:1683536201.923421239
event: 3149, tinfo:0x0004, evstatus:0x0000, pps_num:6017, unrolled_sysclk: 4778837367, unrolled_sysclk_div100MHz: 47.788374, readout_time:1683536202.810594250
event: 3150, tinfo:0x0004, evstatus:0x0000, pps_num:6017, unrolled_sysclk: 4787281817, unrolled_sysclk_div100MHz: 47.872818, readout_time:1683536202.895055225
event: 3151, tinfo:0x0004, evstatus:0x0000, pps_num:6018, unrolled_sysclk: 4813761567, unrolled_sysclk_div100MHz: 48.137616, readout_time:1683536203.159810744
event: 3152, tinfo:0x0004, evstatus:0x0000, pps_num:6018, unrolled_sysclk: 4868798734, unrolled_sysclk_div100MHz: 48.687987, readout_time:1683536203.710568249
event: 3153, tinfo:0x0004, evstatus:0x0000, pps_num:6018, unrolled_sysclk: 4884190810, unrolled_sysclk_div100MHz: 48.841908, readout_time:1683536203.864536880
event: 3154, tinfo:0x0004, evstatus:0x0000, pps_num:6018, unrolled_sysclk: 4890563662, unrolled_sysclk_div100MHz: 48.905637, readout_time:1683536203.928022873
event: 3155, tinfo:0x0004, evstatus:0x0000, pps_num:6019, unrolled_sysclk: 4899052251, unrolled_sysclk_div100MHz: 48.990523, readout_time:1683536204.013163250
event: 3156, tinfo:0x0008, evstatus:0x0000, pps_num:6019, unrolled_sysclk: 4955220084, unrolled_sysclk_div100MHz: 49.552201, readout_time:1683536204.574896525
event: 3157, tinfo:0x0004, evstatus:0x0000, pps_num:6020, unrolled_sysclk: 5036667485, unrolled_sysclk_div100MHz: 50.366675, readout_time:1683536205.389672181
event: 3158, tinfo:0x0004, evstatus:0x0000, pps_num:6022, unrolled_sysclk: 5240561547, unrolled_sysclk_div100MHz: 52.405615, readout_time:1683536207.428935411
event: 3159, tinfo:0x0004, evstatus:0x0000, pps_num:6026, unrolled_sysclk: 5615952911, unrolled_sysclk_div100MHz: 56.159529, readout_time:1683536211.183431464
event: 3160, tinfo:0x0004, evstatus:0x0000, pps_num:6028, unrolled_sysclk: 5875613061, unrolled_sysclk_div100MHz: 58.756131, readout_time:1683536213.780497714
event: 3161, tinfo:0x0008, evstatus:0x0000, pps_num:6029, unrolled_sysclk: 5955050368, unrolled_sysclk_div100MHz: 59.550504, readout_time:1683536214.575015643
event: 3162, tinfo:0x0004, evstatus:0x0000, pps_num:6031, unrolled_sysclk: 6173146108, unrolled_sysclk_div100MHz: 61.731461, readout_time:1683536216.756333067
event: 3163, tinfo:0x0020, evstatus:0x0000, pps_num:6032, unrolled_sysclk: 6200119495, unrolled_sysclk_div100MHz: 62.001195, readout_time:1683536217.026130920
event: 3164, tinfo:0x0006, evstatus:0x0000, pps_num:6033, unrolled_sysclk: 6318970355, unrolled_sysclk_div100MHz: 63.189704, readout_time:1683536218.214638870
event: 3165, tinfo:0x0006, evstatus:0x0000, pps_num:6033, unrolled_sysclk: 6376258554, unrolled_sysclk_div100MHz: 63.762586, readout_time:1683536218.787814999
event: 3166, tinfo:0x0006, evstatus:0x0000, pps_num:6033, unrolled_sysclk: 6469454716, unrolled_sysclk_div100MHz: 64.694547, readout_time:1683536219.719980862
event: 3167, tinfo:0x0006, evstatus:0x0000, pps_num:6034, unrolled_sysclk: 6519561246, unrolled_sysclk_div100MHz: 65.195612, readout_time:1683536220.220869928
event: 3168, tinfo:0x0006, evstatus:0x0000, pps_num:6035, unrolled_sysclk: 6600930294, unrolled_sysclk_div100MHz: 66.009303, readout_time:1683536221.034967955
event: 3169, tinfo:0x0006, evstatus:0x0000, pps_num:6036, unrolled_sysclk: 6689372783, unrolled_sysclk_div100MHz: 66.893728, readout_time:1683536221.919540816
event: 3170, tinfo:0x000a, evstatus:0x0000, pps_num:6036, unrolled_sysclk: 6954888686, unrolled_sysclk_div100MHz: 69.548887, readout_time:1683536224.575132671
event: 3171, tinfo:0x0022, evstatus:0x0000, pps_num:6039, unrolled_sysclk: 7860439903, unrolled_sysclk_div100MHz: 78.604399, readout_time:1683536233.632193536
root [53] rno_g_init_handle(&f,"header/003172.hd.dat.gz","r")
(int) 0
root [54] while (rno_g_header_read(f,&h)) { uint64_t sysclk = h.sys_clk; if (sysclk < last_sysclk) { nrollover++; }  last_sysclk = sysclk; sysclk += nrollover * (1ull << 32); printf("event: %d, tinfo:0x%04x, evstatus:0x%04x, pps_num:%u, unrolled_sysclk: %lu, unrolled_sysclk_div100MHz: %f, readout_time:%d.%09d\n",  h.event_number, h.raw_tinfo, h.raw_evstatus, h.pps_count, sysclk, sysclk/100e6, h.readout_time_secs, h.readout_time_nsecs); }
event: 3172, tinfo:0x0009, evstatus:0x0000, pps_num:6048, unrolled_sysclk: 7954703969, unrolled_sysclk_div100MHz: 79.547040, readout_time:1683536234.574950693
event: 3173, tinfo:0x0005, evstatus:0x0000, pps_num:6049, unrolled_sysclk: 8056833339, unrolled_sysclk_div100MHz: 80.568333, readout_time:1683536235.596454899
event: 3174, tinfo:0x0005, evstatus:0x0000, pps_num:6050, unrolled_sysclk: 8275274021, unrolled_sysclk_div100MHz: 82.752740, readout_time:1683536237.781103623
event: 3175, tinfo:0x0021, evstatus:0x0006, pps_num:6052, unrolled_sysclk: 8954165839, unrolled_sysclk_div100MHz: 89.541658, readout_time:1683536244.571223239
event: 3176, tinfo:0x000a, evstatus:0x0002, pps_num:6059, unrolled_sysclk: 8955985720, unrolled_sysclk_div100MHz: 89.559857, readout_time:1683536244.601327938
event: 3177, tinfo:0x0022, evstatus:0x0002, pps_num:6059, unrolled_sysclk: 8960229245, unrolled_sysclk_div100MHz: 89.602292, readout_time:1683536244.631681119
event: 3178, tinfo:0x0006, evstatus:0x0002, pps_num:6059, unrolled_sysclk: 9192678201, unrolled_sysclk_div100MHz: 91.926782, readout_time:1683536246.956853926
event: 3179, tinfo:0x0006, evstatus:0x0002, pps_num:6061, unrolled_sysclk: 9201089972, unrolled_sysclk_div100MHz: 92.010900, readout_time:1683536247.044989334
event: 3180, tinfo:0x0006, evstatus:0x0002, pps_num:6062, unrolled_sysclk: 9327389173, unrolled_sysclk_div100MHz: 93.273892, readout_time:1683536248.304079175
event: 3181, tinfo:0x0006, evstatus:0x0002, pps_num:6063, unrolled_sysclk: 9724925895, unrolled_sysclk_div100MHz: 97.249259, readout_time:1683536252.280074575
event: 3182, tinfo:0x0006, evstatus:0x0002, pps_num:6067, unrolled_sysclk: 9745096647, unrolled_sysclk_div100MHz: 97.450966, readout_time:1683536252.481555807
event: 3183, tinfo:0x0006, evstatus:0x0002, pps_num:6067, unrolled_sysclk: 9844348133, unrolled_sysclk_div100MHz: 98.443481, readout_time:1683536253.474450699
event: 3184, tinfo:0x000a, evstatus:0x0002, pps_num:6068, unrolled_sysclk: 9954376095, unrolled_sysclk_div100MHz: 99.543761, readout_time:1683536254.574917050
event: 3185, tinfo:0x0006, evstatus:0x0002, pps_num:6069, unrolled_sysclk: 10078319546, unrolled_sysclk_div100MHz: 100.783195, readout_time:1683536255.814539850
event: 3186, tinfo:0x0006, evstatus:0x0002, pps_num:6070, unrolled_sysclk: 10354690806, unrolled_sysclk_div100MHz: 103.546908, readout_time:1683536258.578663952
event: 3187, tinfo:0x0022, evstatus:0x0002, pps_num:6073, unrolled_sysclk: 10497664909, unrolled_sysclk_div100MHz: 104.976649, readout_time:1683536260.008625764
event: 3188, tinfo:0x0006, evstatus:0x0002, pps_num:6075, unrolled_sysclk: 10558835535, unrolled_sysclk_div100MHz: 105.588355, readout_time:1683536260.620468019
event: 3189, tinfo:0x0006, evstatus:0x0002, pps_num:6075, unrolled_sysclk: 10705017201, unrolled_sysclk_div100MHz: 107.050172, readout_time:1683536262.082474446
event: 3190, tinfo:0x0006, evstatus:0x0002, pps_num:6077, unrolled_sysclk: 10802007744, unrolled_sysclk_div100MHz: 108.020077, readout_time:1683536263.052537424
event: 3191, tinfo:0x000a, evstatus:0x0002, pps_num:6078, unrolled_sysclk: 10954224059, unrolled_sysclk_div100MHz: 109.542241, readout_time:1683536264.574914147
event: 3192, tinfo:0x0006, evstatus:0x0002, pps_num:6079, unrolled_sysclk: 11450827520, unrolled_sysclk_div100MHz: 114.508275, readout_time:1683536269.541741728
event: 3193, tinfo:0x0006, evstatus:0x0002, pps_num:6084, unrolled_sysclk: 11558971107, unrolled_sysclk_div100MHz: 115.589711, readout_time:1683536270.623314225
event: 3194, tinfo:0x000a, evstatus:0x0002, pps_num:6085, unrolled_sysclk: 11954086056, unrolled_sysclk_div100MHz: 119.540861, readout_time:1683536274.575030281
event: 3195, tinfo:0x0006, evstatus:0x0002, pps_num:6089, unrolled_sysclk: 12099253119, unrolled_sysclk_div100MHz: 120.992531, readout_time:1683536276.026931333
event: 3196, tinfo:0x0006, evstatus:0x0002, pps_num:6091, unrolled_sysclk: 12234631681, unrolled_sysclk_div100MHz: 122.346317, readout_time:1683536277.380979528
event: 3197, tinfo:0x0006, evstatus:0x0002, pps_num:6092, unrolled_sysclk: 12253577341, unrolled_sysclk_div100MHz: 122.535773, readout_time:1683536277.570424451
event: 3198, tinfo:0x0006, evstatus:0x0002, pps_num:6092, unrolled_sysclk: 12354526201, unrolled_sysclk_div100MHz: 123.545262, readout_time:1683536278.580061080
event: 3199, tinfo:0x0006, evstatus:0x0002, pps_num:6093, unrolled_sysclk: 12500412677, unrolled_sysclk_div100MHz: 125.004127, readout_time:1683536280.038846967
event: 3200, tinfo:0x000a, evstatus:0x0004, pps_num:6095, unrolled_sysclk: 12953928060, unrolled_sysclk_div100MHz: 129.539281, readout_time:1683536284.574919684
event: 3201, tinfo:0x0006, evstatus:0x0000, pps_num:6099, unrolled_sysclk: 13172693326, unrolled_sysclk_div100MHz: 131.726933, readout_time:1683536286.762883552
event: 3202, tinfo:0x0006, evstatus:0x0000, pps_num:6101, unrolled_sysclk: 13211945000, unrolled_sysclk_div100MHz: 132.119450, readout_time:1683536287.155478440
event: 3203, tinfo:0x0006, evstatus:0x0000, pps_num:6102, unrolled_sysclk: 13460597592, unrolled_sysclk_div100MHz: 134.605976, readout_time:1683536289.642362843
event: 3204, tinfo:0x0022, evstatus:0x0000, pps_num:6104, unrolled_sysclk: 13488427104, unrolled_sysclk_div100MHz: 134.884271, readout_time:1683536289.920453327
event: 3205, tinfo:0x0005, evstatus:0x0000, pps_num:6104, unrolled_sysclk: 13536209347, unrolled_sysclk_div100MHz: 135.362093, readout_time:1683536290.398582903
event: 3206, tinfo:0x0005, evstatus:0x0000, pps_num:6105, unrolled_sysclk: 13569360881, unrolled_sysclk_div100MHz: 135.693609, readout_time:1683536290.730112017
event: 3207, tinfo:0x0005, evstatus:0x0000, pps_num:6105, unrolled_sysclk: 13630082012, unrolled_sysclk_div100MHz: 136.300820, readout_time:1683536291.337434112
event: 3208, tinfo:0x0005, evstatus:0x0000, pps_num:6106, unrolled_sysclk: 13688468225, unrolled_sysclk_div100MHz: 136.884682, readout_time:1683536291.921201517
event: 3209, tinfo:0x0005, evstatus:0x0000, pps_num:6106, unrolled_sysclk: 13844279823, unrolled_sysclk_div100MHz: 138.442798, readout_time:1683536293.479695826

Nothing super obvious that would cause it...

barawn commented 3 months ago

Yeah, it looks like it's here.

event: 3163, tinfo:0x0020, evstatus:0x0000, pps_num:6032, unrolled_sysclk: 6200119495, unrolled_sysclk_div100MHz: 62.001195, readout_time:1683536217.026130920
event: 3164, tinfo:0x0006, evstatus:0x0000, pps_num:6033, unrolled_sysclk: 6318970355, unrolled_sysclk_div100MHz: 63.189704, readout_time:1683536218.214638870
event: 3165, tinfo:0x0006, evstatus:0x0000, pps_num:6033, unrolled_sysclk: 6376258554, unrolled_sysclk_div100MHz: 63.762586, readout_time:1683536218.787814999
event: 3166, tinfo:0x0006, evstatus:0x0000, pps_num:6033, unrolled_sysclk: 6469454716, unrolled_sysclk_div100MHz: 64.694547, readout_time:1683536219.719980862

really weird

cozzyd commented 3 months ago

So the pps_count thing is probably going to be tracked as a separate issue on firmware-radiant.

As for the original issue, we can implement fix #1 easily enough (pick closest second to readout_time), but that may have reliability issues if the CPU readout time drifts too much or of it's possible for readout to take longer than 0.5 seconds (it... shouldn't, but who knows, there might be some strange thing happening on the SBC where the acquisition thread can't be serviced for a long time under strange conditions).

While pps_count is unreliable, we should check if sysclk is monotonically increasing, because if it is, we can use it to derive the timing, even if sysclk_last_pps and sysclk_last_last_pps aren't always correct for the event (they should at least correspond to SOME pps).

cozzyd commented 3 months ago

While the firmware stuff is being tracked elsewhere, we can keep this issue open to discuss post-facto fixes.

In the case of station 21 run1120, we can see that readout_time - trigger_time starts to have values not near 0 or 1 at some point in the run:

image

By plotting the subsecond readout and trigger time separately, you can plainly see that the sysclk becomes desynchronized.

image

I actually expected to see identical sysclk for two events but they are different, though by only about ~ 1-2ms. Is it possible for two triggers to be separated by so little? (How long does reading out the LAB4d take?)

image

cozzyd commented 3 months ago

We also see pps_num get dysnchronized here:

image

yeah, ok, so I wonder if the problem here could be be that "dead for waveforms" != "dead for trigger metadata"? If we receive a trigger at the "wrong time" we may record metadata, but not waveforms. Then later, we get the waveforms and it gets associated with the wrong metadata?

barawn commented 3 months ago

The LAB4D readout is O(ms). Depends on what the prescale in the readout is set to but I think it's set to read out at 25 MHz.

cozzyd commented 3 months ago

The other maybe obvious thing from the plot is that the event rate is very high when things go wrong. image

So... the FIFO should probably be overflowing at this time, right?

barawn commented 3 months ago

Yeah, which is why I'm wondering if the overflow/underflow bit is messed up somehow.