How to get dead time from monitoring information?

maxnoe commented 3 years ago

UCTS comes with a busy_counter that should tell as the exact dead time.

Once ucts is reliable, we should use that instead of the estimation implemented in #605

The data is in event.lst.tel[1].evt.ucts_busy_counter I don't know yet how these values must be interpreted.

maxnoe commented 3 years ago

There seems to be a real need for this, since there is a non-zero amount of dead time that seems not to come from cosmic events daq, which would not be covered by the estimation implemented in lstchain.

See this plot of ucts_busy_counter vs. ucts_timestamp after fixing the dtypes in ctapipe_io_lst:

busy_counter

This is for the first 53k Events of run 2610.

maxnoe commented 3 years ago

@mdpunch

Could you maybe confirm or correct that we have to compare the UCTS busy counter with the UCTS clock counter to get the correct dead time fraction?

DirkHoffmann commented 3 years ago

Hello @maxnoe, this plot deserves some investigation indeed. Please open a ticket in the LST-DAQ section for that.

IMHO, you are starting from wrong assumptions. This statement is not correct to all extents:

UCTS comes with a busy_counter that should tell as the exact dead time.

The busy counter (if that is how you call the "event busy counter", according to the official documentation) does not tell the dead time, but the number of events triggered during busy time (hence not read out). E.g. two triggers triggered in a "pure front-end dead-time" period within the 7.8µs dead time of a previously triggered event count for a maximum of … 7.8µs each. But the same two triggers taken in a period with additional second-order dead-time (from network or storage congestion) would account for approximately the inverse of the average busy trigger rate during that period. This topic is too complex to be discussed on the shoulder of a github ticket. Feel free to call me.

maxnoe commented 3 years ago

The busy counter (if that is how you call the "event busy counter", according to the official documentation) does not tell the dead time, but the number of events triggered during busy time (hence not read out).

I don't know in which scenario the number of missed events would be needed. The actual busy time however, is needed for any physics analysis.

How (other than the flawed method of estimating it via poisson statistics) are we supposed to retrieve this information?

moralejo commented 3 years ago

I don't know in which scenario the number of missed events would be needed. The actual busy time however, is needed for any physics analysis.

Isn't the former a proxy of the latter?

maxnoe commented 3 years ago

Isn't the former a proxy of the latter?

Not really, since any number of events can come in the same busy time window, including 0.

moralejo commented 3 years ago

"Statistical proxy" if you want. Then, since we know the interleaved rate, we know how many are physics events. Hence you can correct (again, statistically) for the lost events, which is equivalent to calculating effective time I think.

maxnoe commented 3 years ago

Then we are full circle back again to using a statistical estimate of some higher level data instead of a value provided by the monitoring stream.

maxnoe commented 3 years ago

This topic is too complex to be discussed on the shoulder of a github ticket. Feel free to call me.

@DirkHoffmann While I appreciate the offer and might take you up to it to get started, I think the proper way of obtaining dead time should be documented somewhere, not just given from one person to another in a phone call. This is a centrally important quantity for all analyses and that information must be available in the monitoring data stream.

So could you maybe at least sketch the proper way of obtaining dead time information here or link the corresponding documentation?

mdpunch commented 3 years ago

Hi @maxnoe , I just saw that you pinged me on this.

Could you maybe confirm or correct that we have to compare the UCTS busy counter with the UCTS clock counter to get the correct dead time fraction?

From your plot, it looks like the busy counter is incrementing in bursts, which is strange behaviour.

It should not increment by more than one at a time (and each time should have an accompanying time-stamp), so if it incrementing by more than 1, then likely the TIB is sending the UCTS triggers at a rate higher than the agreed-upon minimum time (500ns, so the TIB can also send us the TriggerType/StereoPattern info).

But it could also be some kind of real effect (flashes on the camera? Shooting stars?) giving a few hundred kHz false rate (false = non-Cherenkov), in which case if the camera is triggering like crazy, then for each non-busy trigger the TIB can fit in up to 14 busy events within the DRS read-out deadtime (7.2us deadtime / 500ns minimum between events).

It should be relatively easy to check from the data which of these hypotheses is the correct one.

By coincidence, I have just written a note for NectarCAM (and LST, if they want it), on a MC evaluation of the deadtime. From this, I would disagree with your parenthesis about "other than the flawed method of estimating it via poisson statistics", since Poisson statistics seem to me to be the optimal method of estimating the dead-time.

But ideally, if you could provide somehow the event time-stamps and trigger types, even just for this run, I could take a look in more detail (even just a numpy array or whatever would be fine).

I would be happy to share this MC dead-time note once it has gone through a sanity-check in NectarCAM, and especially if I can apply it to real data or incorporate some lessons from real data in it!

mdpunch commented 3 years ago

I have checked with HESS, and for the dead-time calculation we indeed use the Poisson statistics. This is described nicely in Gerrit Spengler's PhD thesis (HU Berlin) in Appendix A, which is of course public, here: https://www.physik.hu-berlin.de/de/eephys/HESS/theses/pdfs/gerrit_finalPhD.pdf This is much more detailed than my note, and the calculation is used as one of the monitoring and run-quality criteria for HESS, and indeed is used in correcting fluxes for the dead-time.

maxnoe commented 3 years ago

Ok, that's a bit unfortunate I think, since the poisson estimation really only works if you have pure poisson statistics and not any external noise sources (LIDARS, car flashes, satellites, ...).

In FACT, we have a busy time counter which is stored in the monitoring stream every five seconds.

moralejo commented 3 years ago

I also think the Poisson statistics method is perfectly ok for healthy data. A complication on our side is the presence of interleaved events, which alter the delta_t distribution. But for LST the rate is low enough (compared to cosmics') so that the effect is quite negligible - we have an MC test here: https://github.com/cta-observatory/cta-lstchain/blob/1c224a36a8048f3a90b84e854ed7355d2ffa9449/lstchain/reco/tests/test_utils.py#L91

The info on missed events will anyway be very useful to check the data are healthy.

DirkHoffmann commented 3 years ago

But ideally, if you could provide somehow the event time-stamps and trigger types, even just for this run, I could take a look in more detail (even just a numpy array or whatever would be fine).

@mdpunch, I filed your request in CTA::LST::iDAQ #44197. NB: The run number given above is incorrect. The run in question is r2610 (as correctly reported in CTA::LST::iDAQ #44124.

maxnoe commented 3 years ago

@DirkHoffmann @mdpunch I corrected the typo above and posted a link to the data in the redmine

mdpunch commented 3 years ago

Hi @maxnoe

Thanks for the file, I will take a look.

Also with non-Poisson dead-time introduced by external sources (including interleaved events as mentioned by @moralejo), my own investigation (if I believe my MCs) tells me that the Poisson stats are perfectly able to find the dead-time due to these:

I have checked for a 15kHz LST rate, adding 1.5kHz interleaved "Calibration events", either random or periodic, and even if these extra events somehow disappear and also the "busy-time-tagged" events disappear, I can still determine the dead-time to better than 0.2% per minute.
I have checked adding a periodic loss window of 0.1 seconds per second, with the same result of this introduced dead-time also being correctly estimated.

At the 1st ACTL-Camera meeting in Heidelberg in 2016, the different options were discussed, and we concluded that counting the busy events was sufficient (including Jim Hinton, Felix Werner, David Berge, myself and several others). Note that there is no CTA requirement on sampling the camera busy time, but there is one on time-tagging the busy triggers (as much as possible).

A strobe which samples the "busy" time will only give one aspect of the dead-time, that which is related to the front end. The Poisson stats are better as far as I can see, as it will give the dead-time of the whole chain. I don't think this is simply a matter of taste, like vi/emacs or python/c++ or whatever. If in FACT you have ns time-stamping as well as a busy strobe, then you could compare both approaches, which would be nice.

mdpunch commented 3 years ago

I see from the file what is probably at the origin of the mutual incomprehension.

There are no busy events in the file!

So, LST (and NectarCAM, I guess) doesn't conform to the requirement B-TEL-1265 Busy Triggers: "A Camera must generate and send trigger timestamps to the OES whenever possible in the Observing State, even if it cannot provide data for a given Event."

I suppose this has fallen between the cracks somehow, since the EVB only wants non-busy events, that it can build with the associated data, while the SWAT which could otherwise keep track of these busy time-stamps, doesn't yet exist in the wild (on-site).

@DirkHoffmann what is the best procedure to keep these events in future, as required?

DirkHoffmann commented 3 years ago

Haha, @mdpunch, I am just about to solve one diplomatic problem, and you crate a new one already? I know, this is manipulative. You don't crate the problem, you point it out. Messengers of bad news bear a dangerous life. But as what we write here is world-readable, let me make clear: I appreciate that you point it out.

@DirkHoffmann what is the best procedure to keep these events in future, as required?

This is a change request to the present 1-telescope implementation. You know what that means. :thinking: Anyway, the requirement is valid for the observatory implementation, not for a telescope implementation. And in my opinion we should try to benefit from the observatory implementation prototypes, before starting our own workaround which would a fortiori be temporary. May be a topic for tomorrow's and Friday's meeting.

mdpunch commented 3 years ago

Just to conclude on the run with "hiccups" or "hiccoughs", the results look like this:

Full fit of DeltaT for the run:

Estimates of the numbers and rates:

Estimated of the Dead-time:

cta-observatory / cta-lstchain

How to get dead time from monitoring information? #619