Open wbenoit26 opened 4 hours ago
As a result of (at least) these two bugs, we have quite a few events in gracedb that are separated by less than the 8 second window that we set:
Great catches (and only discoverable thanks to the verbose logging).
Curious about a couple of things:
1410705924
, but read 3 future frames before submitting. Can you explain this?The scitoken validation is something that's done by gracedb, unfortunately. There seems to be some caching, as the subsequent validations take less time, but scitokens expire after an hour (and I'm generating a new one every half-hour to be safe), so the full validation needs to happen relatively often. In a non-MDC setting, I'd imagine it would need to happen every time.
Yeah, we need to wait resampling_padding + whitening_padding + integration_length = 1 + 0.5 + 1
seconds after the event before it's potentially detectable (another one of the nice things about NGDD - the resampling padding can be just a 16th of a second).
Ahh got it, adds up. It's great how easy we are able to diagnose these bottlenecks
As I think about it some more, it's kind of strange that the scitoken validation should take so long. @mcoughlin, is this something that the low latency group is aware of? Maybe we're doing something wrong on our end? Adding 10 seconds of latency to perform authorization seems not great.
@deepchatterjeeligo ?
This 10 seconds obviously doesnt happen every time considering we have many events with sub-10 latency.
Are there any hints under which conditions this added latency occurs?
Not that I can see right off the bat, but let me scrape the logs and see if there are any patterns to it.
Just as an example, for our lowest latency event, the validation steps seem to be the same, but happen much faster:
2024-09-13 20:32:43,121 - root - INFO - Detected event with detection statistic>=7.015
2024-09-13 20:32:43,340 - root - INFO - Event coalescence time found to be 1410319978.232 with FAR 2.240e-07 Hz
2024-09-13 20:32:43,343 - root - INFO - Submitting trigger to file event_1410319978.json
2024-09-13 20:32:43,347 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:43,709 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/ HTTP/1.1" 201 1267
2024-09-13 20:32:45,237 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:45,522 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2214279/log/ HTTP/1.1" 201 397
2024-09-13 20:32:45,700 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:45,884 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2214279/log/ HTTP/1.1" 201 414
2024-09-13 20:32:45,926 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:46,094 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2214279/log/ HTTP/1.1" 201 411
2024-09-13 20:32:46,095 - root - DEBUG - Reading frames from timestamp 1410319981
Ah, wait, at least some of the time here is coming from creating PE/p_astro. Those scitokens
lines correspond, respectively, to event submission, corner plot submission, sky map submission, and p_astro submission. We need better logging in that area to know how much is coming from GDB and how much is on our end.
Got it - probably worth doing this asynchronously now?
Definitely. Still, in the second example in the top comment, there's a 10-second gap between between "Submitting event" and the first scitoken validation line for the first event, and that's not coming from us (all that occurs is here). So I think also worth looking at how much time authorization is costing us.
I'm compiling a list of bugs in our online deployment that I've found while looking at our MDC results. There's been only two so far, but I imagine others will be found.
The
check_refractory
function checks whether the current time is at leastrefractory_period
seconds after the previous detection time, but it ought to compare the new detection time to the previous detection time. Otherwise, if event submission takes a while for some reason, we can have situations like the below, where we submit two events that are essentially at the same time (taken from/home/aframe/dev/o3_mdc/events/log/deploy_2024-09-18T02:30:00.log
)Somehow, the
reset_t0
function can reset to a time prior to the frame that failed. Coupled to this, we don't reset the snapshotter after missing a frame file. This means that it's possible for an event to be detected, a frame to be missed,t0
reset to before the event, and the event to be detected again. For example, in/home/aframe/dev/o3_mdc/events/log/deploy_2024-09-18T11:25:49.log
: