c3-time-domain / SeeChange

A time-domain data reduction pipeline (e.g., for handling images->lightcurves) for surveys like DECam and LS4
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

Worries about exposure provenances #310

Open rknop opened 2 weeks ago

rknop commented 2 weeks ago

Some thought is required about exposure provenances.

First, in Instrument.find_origin_exposures, there's a "skip_known_exposures" keyword. Right now, that skips any known exposures, but we ought to be able to specify the provenances that we want to search on. (This requires the rovenance tagging system we've talked about to be in place, so we have a way of easily and in a text-based way of specifying things like "current" or "dr1" or whatever provenance.)

However, with exposures, there are further issues. Getting exposures over the net is a relatively expensive operation, so we don't want to do it willy-nully. Exposures are also big files, so we don't want to replicate the files in the database. Realistically, the raw exposures will hardly ever (or never) change on the remote server, so pratically we'll never need to get them again. However, if we change the code_version, then all of the existing exposures in the database will no longer be for the current provenance. This means one of a few things, depending on how we implement stuff. It might mean that code we run to redo things will re-get all of the provenances from the origihn source, which we should avoid. If we don't do that, but stick with the ones in the database, then new images created will not have all the same provenances, because there will be exposures in there with different provenances.

Right now, the only way to avoid a mess is to never bump the code_version, but that's not really a good option, because the code_version is there for a reason. But we need to do something with exposures so that we can flag that these really are the same exposures when the code version changes without requiring us to re-download big exposure files, bloat the archive with redundant copies, and add new entries to the database.

guynir42 commented 2 weeks ago

What if we make a special code version only for exposures, and say we update it only when the telescope firmware is changed dramatically (i.e., never).

We use the same code version because it is easy, but there's no rule against using separate code versions for different parts of the pipeline. In fact, the same problem you are raising here, can happen when you change the code of the measurements and don't want that to cause a recalculation of all the images (and other products upstream of measurements). One way to do that is to call code versions by their process+version, e.g., measuring_v1.0.0.

rknop commented 2 weeks ago

Yeah, I was musing about something along those lines. That would be a good idea -- at least for exposures. But maybe for all steps. The danger of having lots of different code versions is that it increases the burden on the user to know what to bump and when, but we already have some burden of manual bumping, so that's probably not a real problem.