Data versioning - Githubissues

alecgunny commented 1 year ago

Given how complicated our data picture is becoming, it's probably worth thinking about being more formal about how we track data alongside versions of the repo that created it. As far as I can tell, we essentially have 5 groups of data artifacts

Training background segment
Training glitches
Training waveforms (separate because we don't do SNR rejection)
Testing background segments
Testing waveforms

There's an interesting tool out there call dvc which is built for something like this (and I'm sure there are several others). There's an interesting basic tutorial here, and there's a Python API that we could integrate into our code.

I'm still figuring out how these tools work, but I think potential set up could look like:

On each cluster, created a shared data cache that the code points to, maybe via environment variables (ideally we would have one cache shared between all clusters but not sure how feasible this is)
The datagen project is moved out of sandbox and into its own pipeline that gets run as part of CI/CD, and that data gets pushed to the relevant caches. This is considered the "production" data. We can start seeding the data generation so that e.g. sampled waveforms won't change if the code for generating them hasn't changed.
At experiment run time, we perform some check to ensure that the data being loaded is consistent with the current version of the repo
Users can host their own local caches for experimenting with new data generation mechanisms

If something like this works, we can even think about tracking experiment artifacts using this as well, but that's obviously for farther down the line.

wbenoit26 commented 1 year ago

I like the idea of having a shared cache of data, and in general being more conscious of how the PRs we do update our model's performance, maybe adopting a practice of doing a full pipeline run after any major change (for however we define "major").

I think it makes sense to start out manually, figure out what our categories are and how we want to structure things, and then move to a more automated solution. After seeing our most recent results, I'm worried about changing things too quickly. Our datasets aren't so complicated yet that it would be difficult. As far as I can recall, we have the following:

Background:

Original background time segment
Current background time segment

Glitches:

Glitches from original background
Glitches from current background
Glitches from current background with timestamps

Waveforms:

Non-spin BBH prior
Incorrect end O3 rates and pops prior
Corrected end O3 rates and pops prior

Testing background:

Original test background
Current 3-day test background

Testing foreground:

Original background, non-spin BBH
Current background, non-spin BBH
Current background, incorrect end O3 rates and pops
Current background, incorrect end O3 rates and pops, with SNR rejection <- Not sure that this ever actually existed
Current background, corrected end O3 rates and pops, with SNR rejection

alecgunny commented 1 year ago

Oh yeah this is not an immediate concern by any means, just something I was thinking about and wanted to jot down so I could close the relevant Chrome tabs. Your list sounds about right to me, that's actually a handy thing to have for the paper when discussing how we arrived at the data we arrived at.

I think as a conceptual matter, something like this is kind of clean, because right now you could in principle run the non-data generating parts of the pipeline on outdated data thanks to our caching scheme, and there's obviously risks involved there. But given our current operating scale that's something we can be responsible for keeping track of ourselves.

ML4GW / aframe

Data versioning #316