Comparison point for our offline throughput

alecgunny commented 10 months ago

Eliu's paper from 2017 reports that they "process the entire month of August 2017 with our deep learning ensemble in just 7 min" using 64 V100s. We should figure out exactly how much active data this consists of and use this to estimate the throughput per GPU (in units of seconds of data per second), then compare this to our own throughput.

We can start by analyzing the server stats from our runs on 1-year and 2-month datasets (which have different client:GPU ratios) and see which one produces better throughput.

wbenoit26 commented 10 months ago

I get a throughput of about 56 s/s via the following:

from mldatafind.segments import query_segments
segments = query_segments(["H1_DATA", "L1_DATA"], 1185580818, 1187733618)
throughput = sum([j - i  for (i, j) in segments]) / (7 * 60) / 64

Note that 1187733618 corresponds to 10 PM on August 25th, because I think that's when O2 stopped. At least, it's when GWOSC stops having open data

EthanMarx commented 10 months ago

Wow, orders of magnitude faster...

Also going to note here that it might be useful to report a total compute budget (and maybe conversion to cost in $) for operating online, taking into account amount of background we wan't to analyze on the fly etc.

High level overview:

GPU-hours required to analyze 0-lag throughout the entire run (1, I think)
GPU-hours required to accumulate Tb seconds worth of timeslides throughout the run. If we decide using a rolling buffer of background makes the most sense, will depend on how often we wan't to update the buffer, and how many seconds of data we keep in that buffer (which will be decided by what we want our most significant FAR to be).
CPU-hours - 1 for loading in data to analyze 0lag + N clients for accumulating background throughout the run, + CPU cost of generating training data.

ML4GW / aframe

Comparison point for our offline throughput #442