c3-time-domain / SeeChange

A time-domain data reduction pipeline (e.g., for handling images->lightcurves) for surveys like DECam and LS4
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Reference sets #287

Closed guynir42 closed 4 months ago

guynir42 commented 6 months ago

BACKGROUND: I'm working on the Reports PR, and to do that we need to have the provenances of the entire pipeline run, calculated up front. I think this is a good idea for many reasons.

PROBLEM: to get the provenance of the subtraction we need to know the provenance of the reference (or, more specifically, the provenances of the reference image and all its products, but you can get one from the other). But there could be many references that go along with the exposure/image you are working on, potentially with different provenances. You can choose the one with most overlap (and within the validity range) but that can still end up giving you different provenances based on the exact image and reference that are available.

SOLUTION: I think the thing that we should go by is this: policy cannot be determined by data. That means the provenances of a pipeline run should only be determined by the parameters in the config (and the code version) and not be dependent on availability of data.

SPECIFICS: to get this to work, I suggest we add a RefSet table, that has a name, a time validity start/end, and a single reference provenance ID. You'll have something like "standard", "commissioning" and maybe "aofno1inf1i3if3n" if you make a custom reference with a randomly generated name.

When we make the first set of references they will all have the same parameters for the preprocessing, extraction, etc. They will all have the same instrument so the exposure provenances will be the same. This information is encoded in the upstreams of the reference provenance. We'll tag this provenance ID as the "commissioning" RefSet. Then we'll have a good number of images on each place in the sky and we will create deeper, "official" references and tag them with a new RefSet called "standard" or something.

It makes sense that the provenance parameters of the Reference will include some information about the policy of how we made the RefSet, at minimum it will contain the name of the RefSet. That way the reference provenances will be different for commissioning and for standard. We can also have multiple Refset entries with the same name and different validity range, so we can change policy, add more images, or whatever we want e.g., each year of the survey.

The downside is that if we are half way through making references and decide to change some parameter or the code version in a way that changes the provenance, we will have to mark the RefSet as invalid (e.g., by setting some boolean or just by making an impossible validity range), and recreate the entire reference set and all the data that used it.

This is a bit alarming, but it is what we want: to have a policy for making references that doesn't change and that all images taken within some time range (e.g., a year) will all use exactly the same references.

NEXT STEPS: as of right now I am using the latest reference provenance for everything, but this is a placeholder until we get this sorted out. I don't want to add this to the current Reports PR. This is still open for discussion so let me know what you think.