Closed rknop closed 12 months ago
I'm not sure about this. In general, we want to have a 1-to-1 correspondence of row-on-DB with file-on-disk. "In general" we don't change the file on disk, because that would bring a whole mess of problems (in terms of provenance but not just). As you say, we can't control somebody else making changes to files. We should generally not let that happen and assume that a file is created when the database object is created and that's it.
Of course, we have this annoying exception of the image headers. I think this is really the only place where we want to modify the files after they are created. And it only happens in specific places in the pipeline, which we agreed on, that is the "first look" header data. After that is done, the headers are never altered, and any modifications to, e.g., the seeing FWHM are going to go into the new database objects like PSF.
So my first hunch would be that we need a method that is allowed to modify an existing image file's header only. And that save_and_commit()
should not, in general, have to re-write any files that already exist. Then only in the cases where you finish making a "first look" astrometry/photometry (etc) then you explicitly modify the headers for the image alone (I don't know if that requires a re-write of the entire file, maybe it does for the archive).
I would also argue that having that modified header in the archive is not critical for what we are doing. We can always fill in that info from the database. So having that written into the archive image's header at the end of photometric calibration, or somewhere else, as an added bonus (so if it fails or if you don't reach that stage for some reason) you don't lose any critical data. Then we can live with one instance of re-sending an image (without additional files) over to the archive at one point in the pipeline, just for the convenience of having an updated header.
What do you think?
I think there's a conflict between the design idea of the pipeline and how we need it to work in reality.
Here's how I've dealt with the issue in previous pipelines: I didn't save things to the archive until they were "done". Images did get read and written multiple times, as they were preprocessed, then astrometrically calibrated, then photometrically calibrated. But they were only written to local disk, and only after that was done would they get written the archive. The lensgrinder pipeline, in fact, doesn't write anything for an image to the archive until the entire process (i.e. extract image from exposure, process image, calibrate image, subtract from reference, find and save detections on the subtraction) were done. I/O is slower than processing, and is often a bottleneck for the kinds of things we do. I/O to the archive is that much slower than local disk I/O. So, it was worth avoiding.
When we talked with Peter a couple of weeks ago, there was the strong sense that we did want the images stored to have an updated WCS and zeropoint in them. If we want that, then we need to have a way of telling the DataStore when it should and when it shouldn't resave things, copying them to the archive, and updating their m5sum.
We've got three choices:
(1) Don't view an image as a data product until it's all the way through photometric calibration. Then we can do what we do now, and only save it once. This would mean rearchitecturing a bit, as we won't be able to view preprocessing, astrometric calibration, and photometric calibration as separate steps the way we do right now. This is closest to how pipelines I've used in the past have worked.
(2) The DataStore gets a way of marking things "dirty" so it knows what it needs to save and what it doesn't. In particular, when an image header is updated due to astrometric calibration or photometric calibration, it should not re-save the weights and flags files, and it should not re-save any source lists that it has.
(3) We view the image data product as the result of preprocessing, and once it's written to the archive with an md5sum in the database, it does not change again. This means that if people get the images from somewhere, they won't have the right WCS and zeropoint in the header. We can provide a service that updates the headers and delivers the image with those things inserted. This is probably the cleanest solution that fits best with our design, but it does violate various standard assumptions people will make, including that the processed images they can pull out of HPSS are the images that result from all of the pipeline's processing. It's also the least simple solution, in that just getting the data products require infrastructure beyond access to the data, and require training people to use that infrastructure instead of just finding the files on NERSC (which is how people usually operate).
Two things I do not want to consider at all, because they are just both too inefficient:
First, just writing the image, weights, and flags over and over again, and copying it to the archive over and over again, each time we call a "save_and_commit" on a data store that's read in images, weights, flags, and/or sources. (Indeed, solution #2 already does too much of this as it is, but if we want to save the image to the archive every time it's line in the database is created or updated and we write it to the local store, then we have no choice.) This is what we're doing right now.
Second, viewing the "preprocessed image", "astrometrically calibrated image", and "photometrically calibrated image" as three different data products. That will expand how much we have to store and keep track of, and it's worse because the flags and weight images, which do not change with these processes, will get copied each time, since the three images together are packaged as a single data product from the database's point of view. It will mean that the archive storage is a factor of 3 bigger than it needs to be.
If we are going to view this from a practical point of view, rather than an elegance of design point of view, the results of the preprocessing are usually not saving. It's a fast enough step that if you have the exposure already, then extracting the image and running the preprocessing steps is fast enough that it may well be slower to pull the image from the archive (if it's not on local disk) than it would be to just recalculate it. The only place where saving these images would add meaningful efficiency is if for some reason you want to start with the preprocessed image of just one or a few chips; then, the inefficiency of pulling over the entire exposure kills you. (If you want to do the entire detector, then no matter what you're going to have to pull over an exposure-sized block of things. And, indeed, the exposure is smaller than the image plus flags plus weight, because it doesn't have the latter two images, and because it's probably stored compresed in one or two different ways (bscaling, plus then gzipping or fzipping).)
The photometric step is also pretty fast, once you have the calibration catalog local. The astrometric step will tend to be the slowest, because the pattern-matching of stars to line things up can take some time.
These considerations would suggest that #1 is the best solution above. But, that would mean that we'd have to redesign the pipeline a bit. We wouldn't just have lazy-loading, but lazy-calculating.
I'm suggesting we sort of adopt #2, in that we do decide that the image after preprocessing gets saved (without header) and then it gets saved again after all the calibration is done (in that sense, when calibration is done, you can imagine it is marked as "dirty").
But that's the only place I can see that we need to do that. So I don't think we really need infrastructure to do this, we just need a special command that does "update image header" and call it once. The save_and_commit
should be built in a way that it doesn't go and re-save the image at every step. If that's what is happening right now, it needs to be changed, obviously.
That way you get a pipeline that can be stopped and started at each step in the process (which is not a must-have, but it is nice), and we have to upload the image file twice (not the additional flags and weights, if they are saved separately).
The downside that I can see (besides that one re-save) is that we might sometimes fail to finish the calibration, then we'd have image files of pre-processed images that don't have a header. I don't think that's too bad.
That's basically what I'm implementing right now in the astrometry branch.
DataStore.save_and_commit() was always rewriting thing; I'm putting in a check that if the m5sum is there, don't write. But, then I'm adding the ability to have an image header exception, which I will then use in the astrometry (and, perhaps later, photometric calibration) code.
Good. I thought it was already implemented that when md5 is not None then you don't need to re-write. If it wasn't in there yet, I'm happy that you put it in!
Closing this... it was resolved by putting in the special-case code for rewriting the image header.
Right now, DataStore.save_and_commit() will do some gratuitous I/O, and gratuitous I/O is something we really want to avoid.
An example shows up in the test_astro_cal.py (on my branch as of this writing, "soon" to be in a PR). It starts with a DataStore that already has image, weight, flags, and sources on disk. It solves for the WCS. It needs to save over the image (since the header chaged), but DataStore.save_and_commit() will write all four of the files.
What I'd propose we do is add additional properties (perhaps a dictionary?) to the data store that marks various data products as "dirty". If the data store reads the data product from the database (in the get_* functions), it will consider it clean. If, on the other hand, the data store's properties are manually set (e.g. in the @prop.setter functions), then it will mark them as dirty. save_and_commit() will only actually try to save the dirty ones.
There is one additional issue with this, though: if an external function modifies a data product (again, the example being the image header changing when the astrometric solution is done), it will be the responsibility of the external function to mark it dirty, since there's no way that DataStore could know this (without implementing a whole bunch of additional overly-complicated for accessing the data products of a DataStore). This makes the whole thing a little more fiddly and dangerous. But, I think it will be worth it in order to minimize the number of gratuitous disk writes that happen.