ML4GW / aframe-legacy

Detecting binary black hole mergers in LIGO with neural networks
MIT License
18 stars 17 forks source link

Data versioning #316

Open alecgunny opened 1 year ago

alecgunny commented 1 year ago

Given how complicated our data picture is becoming, it's probably worth thinking about being more formal about how we track data alongside versions of the repo that created it. As far as I can tell, we essentially have 5 groups of data artifacts

There's an interesting tool out there call dvc which is built for something like this (and I'm sure there are several others). There's an interesting basic tutorial here, and there's a Python API that we could integrate into our code.

I'm still figuring out how these tools work, but I think potential set up could look like:

If something like this works, we can even think about tracking experiment artifacts using this as well, but that's obviously for farther down the line.

wbenoit26 commented 1 year ago

I like the idea of having a shared cache of data, and in general being more conscious of how the PRs we do update our model's performance, maybe adopting a practice of doing a full pipeline run after any major change (for however we define "major").

I think it makes sense to start out manually, figure out what our categories are and how we want to structure things, and then move to a more automated solution. After seeing our most recent results, I'm worried about changing things too quickly. Our datasets aren't so complicated yet that it would be difficult. As far as I can recall, we have the following:

Background:

Glitches:

Waveforms:

Testing background:

Testing foreground:

alecgunny commented 1 year ago

Oh yeah this is not an immediate concern by any means, just something I was thinking about and wanted to jot down so I could close the relevant Chrome tabs. Your list sounds about right to me, that's actually a handy thing to have for the paper when discussing how we arrived at the data we arrived at.

I think as a conceptual matter, something like this is kind of clean, because right now you could in principle run the non-data generating parts of the pipeline on outdated data thanks to our caching scheme, and there's obviously risks involved there. But given our current operating scale that's something we can be responsible for keeping track of ourselves.