c3-time-domain / SeeChange

A time-domain data reduction pipeline (e.g., for handling images->lightcurves) for surveys like DECam and LS4
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Measurements and Cutouts #292

Closed guynir42 closed 3 weeks ago

guynir42 commented 1 month ago

Here's a summary of the debate around measurements going into the cutouts process/file:

Problem: we want to keep measurements around, even if they failed the cuts, so we can examine them and figure out if our cuts make sense. But that would mean bloating the measurements table, which is likely to be the longest one in the database.

Option 1: Cutouts will have a single file, single row in the cutouts table, and will be made in a separate process than measurements. We save all cutouts. Then we only save measurements that have passed the cuts. Advantage: we can make one cutouts file and many versions of the measurements. Disadvantage: we don't get to look at the measurements that failed the cuts.

Option 2: Cutouts will have a single file, single row in the cutouts table, but will be generated in the same process (i.e., will have the same provenance) as the measurements. We still save only the passing measurements to DB but will put a copy of the results in the cutouts file, such that all cutouts and all measurements get saved to disk, regardless of cuts. Advantage: we can look at the bad measurements, inside the cutouts file. Disadvantage: we must make new cutouts files each time we change a parameter on the measurements. We will have to do that anyway when we change code or parameters upstream of cutouts (e.g., on subtraction).

I have thought about this a little and I have a third proposal that I think answers all our needs:

Option 3: Do option 1, but add a secondary threshold dictionary. That one decides which measurements are deleted, and can be set to very loose thresholds if you want to see "everything". The regular threshold will no longer delete measurements, but instead mark them as "bad". You'll have to mark the associated Objects as bad, too. We can decide we are keeping all the "bad" objects separate from the good ones, and that gives us the advantage that we can check if we systematically have "bad" objects that have multiple measurements over time (that would be suspicious).

Then, we have an easy way to look at the bad measurements and periodically clean them up from our database. This gives the sysadmin more control over the bloat of that table, and makes it easier to find the "bad" measurements and inspect a subsample to check our thresholds are not too strict.

If, at some point, you wish to change this policy and just go back to deleting all the bad measurements you'd just have to adjust the deletion threshold to match the regular one (we can decide that if the delete threshold is None we just use the regular threshold for both cuts).

We can even make the deletion threshold a non-critical parameter, so changing it would not alter the provenance of the "good" measurements, that would continue to be saved regardless of the save status of "bad" measurements.

whohensee commented 1 month ago

Nice, this is a great summary. I will look into the relevant part of the codebase to prepare and get an idea of what implementing it will involve.