c3-time-domain / SeeChange

A time-domain data reduction pipeline (e.g., for handling images->lightcurves) for surveys like DECam and LS4
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Don't save cutouts whose measurements don't pass #272

Closed rknop closed 4 weeks ago

rknop commented 1 month ago

Right now, we're saving all cutouts to the database, but only saving measurements that pass various analytic cuts.

We should only save cutouts when the measurements pass.

We might want to unify the measurements and cutouts tables, because there's a 1:1 relationship between the two. Perhaps a CutoutsFile table (that is a FileOnDiskMixin) that we've talked about, which may or may not have cutouts that aren't in the measurements table.

guynir42 commented 1 month ago

I can see the advantages in making this change, especially in terms of simplifying the pipeline. I do see some advantages to the way we do things now: We may want to try lots of different measuring algorithms and if we decide to keep them, then we will need to have copies of the cutouts file for each version. If we keep them separate then we can have one file and lots of database rows for measurements with different provenances. Also, keeping all cutouts on file while deleting the database entries for those that didn't pass the quality cuts is not so simple, we'd have to make sure all the database information is saved in the file in case we want to recreate them. Also, the cutouts table doesn't have a lot of columns, so I'm not sure it will be such a big deal to leave all cutouts on DB.

That said, we may just split it a bit differently: have one provenance step for the Cutouts (or CutoutsFile) that saves all the cutouts and takes up one row in the DB as a FileOnDiskMixin, and the other step for Measurements (or call it Cutouts?) and that one saves only the ones that pass, but include what was saved in individual cutouts before, as part of the measurements row (e.g., the coordinates in the image, and lazy loaded small images for ref/new/sub). This would not be very different from what we have now, but maybe makes more sense with the CutoutsFile.

rknop commented 1 month ago

I like the latter solution. A CutoutsFile table in the database that points to the HDF5 file, and the HDF5 file has all the cutouts (with indexes the same as indexes in the source_list), plus all the meta data (at least x and y; not sure if it needs much else).

Then, a Measurements table with rows only for things that pass the preliminary cuts. Includes a pointer to the relevant CutoutsFile and to the index needed to find the actual cutout (just like Cutouts does right now). And, yeah, include all the information that's currently in the Cutouts table.

rknop commented 1 month ago

Another note: in the test exposure I'm working on for the stress test, I have 46 (out of 60) chips successfully completing a subtraction at the moment. (I still have to dig to figure out why there are failures. From the zogy subtraction and detection, there are 3794 sources, 89 of which pass the preliminary cuts. Even though the cutouts table may not have a lot of columns, it will have a lot of rows. The number of measurements that pass the preliminary cutouts is going to be a couple of order of magnitudes more than the number of exposures; the ones that don't pass the preliminary cuts are more than an order of magnitude more than that. Not saving all of that to the database will, as our number of exposures mounts, help performance a lot.

guynir42 commented 1 month ago

Related to #217