ML4GW / aframe

Detecting binary black hole mergers in LIGO with neural networks
MIT License
6 stars 16 forks source link

Online deployment bugs #301

Open wbenoit26 opened 4 hours ago

wbenoit26 commented 4 hours ago

I'm compiling a list of bugs in our online deployment that I've found while looking at our MDC results. There's been only two so far, but I imagine others will be found.

  1. The check_refractory function checks whether the current time is at least refractory_period seconds after the previous detection time, but it ought to compare the new detection time to the previous detection time. Otherwise, if event submission takes a while for some reason, we can have situations like the below, where we submit two events that are essentially at the same time (taken from /home/aframe/dev/o3_mdc/events/log/deploy_2024-09-18T02:30:00.log)

    2024-09-17 20:30:28,736 - root - DEBUG - Reading frames from timestamp 1410665418
    2024-09-17 20:30:28,766 - root - DEBUG - Read successful
    2024-09-17 20:30:28,985 - root - INFO - Detected event with detection statistic>=3.342
    2024-09-17 20:30:29,970 - root - INFO - Event coalescence time found to be 1410665416.250 with FAR 9.457e-05 Hz
    2024-09-17 20:30:29,981 - root - INFO - Submitting trigger to file event_1410665416.json
    2024-09-17 20:30:29,984 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:30,273 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/ HTTP/1.1" 201 577
    2024-09-17 20:30:34,210 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:34,488 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2229021/log/ HTTP/1.1" 201 397
    2024-09-17 20:30:34,689 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:34,868 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2229021/log/ HTTP/1.1" 201 414
    2024-09-17 20:30:45,068 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:45,269 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2229021/log/ HTTP/1.1" 201 411
    2024-09-17 20:30:45,269 - root - DEBUG - Reading frames from timestamp 1410665419
    2024-09-17 20:30:45,291 - root - DEBUG - Read successful
    2024-09-17 20:30:45,524 - root - INFO - Detected event with detection statistic>=3.342
    2024-09-17 20:30:45,791 - root - INFO - Event coalescence time found to be 1410665416.252 with FAR 9.462e-05 Hz
    2024-09-17 20:30:45,792 - root - INFO - Submitting trigger to file event_1410665416.json
    2024-09-17 20:30:45,797 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:46,067 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/ HTTP/1.1" 201 584
    2024-09-17 20:30:50,285 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:50,570 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2229022/log/ HTTP/1.1" 201 397
    2024-09-17 20:30:50,806 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:30:50,979 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2229022/log/ HTTP/1.1" 201 414
    2024-09-17 20:31:02,815 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/328592bce2ee7ac5745c4fc89082da03?type=accessToken&ts=1726626606482&version=v2.0&lifetime=10800000
    2024-09-17 20:31:02,982 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2229022/log/ HTTP/1.1" 201 411
    2024-09-17 20:31:02,983 - root - DEBUG - Reading frames from timestamp 1410665420
    2024-09-17 20:31:03,006 - root - DEBUG - Read successful
    2024-09-17 20:31:03,167 - root - DEBUG - Reading frames from timestamp 1410665421
  2. Somehow, the reset_t0 function can reset to a time prior to the frame that failed. Coupled to this, we don't reset the snapshotter after missing a frame file. This means that it's possible for an event to be detected, a frame to be missed, t0 reset to before the event, and the event to be detected again. For example, in /home/aframe/dev/o3_mdc/events/log/deploy_2024-09-18T11:25:49.log:

    2024-09-18 07:45:23,248 - root - DEBUG - Reading frames from timestamp 1410705927
    2024-09-18 07:45:23,264 - root - DEBUG - Read successful
    2024-09-18 07:45:23,356 - root - INFO - Detected event with detection statistic>=3.081
    2024-09-18 07:45:24,997 - root - INFO - Event coalescence time found to be 1410705924.787 with FAR 1.381e-04 Hz
    2024-09-18 07:45:25,010 - root - INFO - Submitting trigger to file event_1410705924.json
    2024-09-18 07:45:35,069 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:45:35,070 - urllib3.connectionpool - DEBUG - Resetting dropped connection: gracedb-playground.ligo.org
    2024-09-18 07:45:35,512 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/ HTTP/1.1" 201 583
    2024-09-18 07:45:37,525 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:45:37,803 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2230630/log/ HTTP/1.1" 201 397
    2024-09-18 07:45:37,997 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:45:38,187 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2230630/log/ HTTP/1.1" 201 414
    2024-09-18 07:45:46,703 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:45:46,888 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2230630/log/ HTTP/1.1" 201 411
    2024-09-18 07:45:46,888 - root - DEBUG - Reading frames from timestamp 1410705928
    2024-09-18 07:45:46,908 - root - DEBUG - Read successful
    2024-09-18 07:45:46,985 - root - DEBUG - Reading frames from timestamp 1410705929
    2024-09-18 07:45:47,002 - root - DEBUG - Read successful
    2024-09-18 07:45:47,086 - root - DEBUG - Reading frames from timestamp 1410705930
    2024-09-18 07:45:57,096 - root - WARNING - Couldn't find frame file /dev/shm/kafka/L1_O3ReplayMDC/L-L1_O3ReplayMDC_llhoft-1410705930-1.gwf after 10s
    2024-09-18 07:45:57,097 - root - WARNING - Missing frame files after timestep 1410705930, resetting states
    2024-09-18 07:45:57,106 - root - INFO - Resetting timestamp to 1410705924
    2024-09-18 07:45:57,106 - root - DEBUG - Reading frames from timestamp 1410705924
    2024-09-18 07:45:57,125 - root - DEBUG - Read successful
    2024-09-18 07:45:57,125 - root - DEBUG - Reading frames from timestamp 1410705925
    2024-09-18 07:45:57,142 - root - DEBUG - Read successful
    2024-09-18 07:45:57,143 - root - DEBUG - Reading frames from timestamp 1410705926
    2024-09-18 07:45:57,160 - root - DEBUG - Read successful
    2024-09-18 07:45:57,253 - root - DEBUG - Reading frames from timestamp 1410705927
    2024-09-18 07:45:57,270 - root - DEBUG - Read successful
    2024-09-18 07:45:57,338 - root - INFO - Detected event with detection statistic>=3.170
    2024-09-18 07:45:58,263 - root - INFO - Event coalescence time found to be 1410705924.787 with FAR 1.215e-04 Hz
    2024-09-18 07:45:58,265 - root - INFO - Submitting trigger to file event_1410705924.json
    2024-09-18 07:45:58,268 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:45:58,610 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/ HTTP/1.1" 201 585
    2024-09-18 07:46:00,586 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:46:00,871 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2230631/log/ HTTP/1.1" 201 397
    2024-09-18 07:46:01,059 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:46:01,283 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2230631/log/ HTTP/1.1" 201 414
    2024-09-18 07:46:10,617 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/f3f4a63ecccd6b73b9d7410279ffa5?type=accessToken&ts=1726669802473&version=v2.0&lifetime=10800000
    2024-09-18 07:46:10,795 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2230631/log/ HTTP/1.1" 201 411
    2024-09-18 07:46:10,796 - root - DEBUG - Reading frames from timestamp 1410705928
wbenoit26 commented 4 hours ago

As a result of (at least) these two bugs, we have quite a few events in gracedb that are separated by less than the 8 second window that we set: image

EthanMarx commented 4 hours ago

Great catches (and only discoverable thanks to the verbose logging).

Curious about a couple of things:

  1. Are we doing that Sci-token validation every gdb upload, or is that something done by gracedb? seems to take a significant amount of time
  2. In that second output, it looks like we detect an event in frame 1410705924, but read 3 future frames before submitting. Can you explain this?
wbenoit26 commented 4 hours ago

The scitoken validation is something that's done by gracedb, unfortunately. There seems to be some caching, as the subsequent validations take less time, but scitokens expire after an hour (and I'm generating a new one every half-hour to be safe), so the full validation needs to happen relatively often. In a non-MDC setting, I'd imagine it would need to happen every time.

Yeah, we need to wait resampling_padding + whitening_padding + integration_length = 1 + 0.5 + 1 seconds after the event before it's potentially detectable (another one of the nice things about NGDD - the resampling padding can be just a 16th of a second).

EthanMarx commented 4 hours ago

Ahh got it, adds up. It's great how easy we are able to diagnose these bottlenecks

wbenoit26 commented 4 hours ago

As I think about it some more, it's kind of strange that the scitoken validation should take so long. @mcoughlin, is this something that the low latency group is aware of? Maybe we're doing something wrong on our end? Adding 10 seconds of latency to perform authorization seems not great.

mcoughlin commented 4 hours ago

@deepchatterjeeligo ?

EthanMarx commented 4 hours ago

This 10 seconds obviously doesnt happen every time considering we have many events with sub-10 latency.

Are there any hints under which conditions this added latency occurs?

wbenoit26 commented 3 hours ago

Not that I can see right off the bat, but let me scrape the logs and see if there are any patterns to it.

wbenoit26 commented 3 hours ago

Just as an example, for our lowest latency event, the validation steps seem to be the same, but happen much faster:

2024-09-13 20:32:43,121 - root - INFO - Detected event with detection statistic>=7.015
2024-09-13 20:32:43,340 - root - INFO - Event coalescence time found to be 1410319978.232 with FAR 2.240e-07 Hz
2024-09-13 20:32:43,343 - root - INFO - Submitting trigger to file event_1410319978.json
2024-09-13 20:32:43,347 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:43,709 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/ HTTP/1.1" 201 1267
2024-09-13 20:32:45,237 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:45,522 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2214279/log/ HTTP/1.1" 201 397
2024-09-13 20:32:45,700 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:45,884 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2214279/log/ HTTP/1.1" 201 414
2024-09-13 20:32:45,926 - scitokens - INFO - Validating SciToken with jti: https://cilogon.org/oauth2/3a4fa254fbbabd4069f11ae7d2f5d142?type=accessToken&ts=1726284601735&version=v2.0&lifetime=10800000
2024-09-13 20:32:46,094 - urllib3.connectionpool - DEBUG - https://gracedb-playground.ligo.org:443 "POST /api/events/G2214279/log/ HTTP/1.1" 201 411
2024-09-13 20:32:46,095 - root - DEBUG - Reading frames from timestamp 1410319981
wbenoit26 commented 3 hours ago

Ah, wait, at least some of the time here is coming from creating PE/p_astro. Those scitokens lines correspond, respectively, to event submission, corner plot submission, sky map submission, and p_astro submission. We need better logging in that area to know how much is coming from GDB and how much is on our end.

EthanMarx commented 3 hours ago

Got it - probably worth doing this asynchronously now?

wbenoit26 commented 3 hours ago

Definitely. Still, in the second example in the top comment, there's a 10-second gap between between "Submitting event" and the first scitoken validation line for the first event, and that's not coming from us (all that occurs is here). So I think also worth looking at how much time authorization is costing us.