Unnecessary duplicate arrays when using asdf embedded in FITS files

perrygreenfield commented 2 years ago

JP-2900 reports a failure of calwebb_spec3when it tries to create source-based products containing slit cutouts from multiple exposures. In one case the asdf extension grew to larger than 4GB, exceeding the size that astropy.io.fits can handle for a binary table entry. This revealed two problems. First, the pipeline was generating cutouts that made no sense (with vertical sizes in pixel in the hundreds) illustrating a problem with the segmentation algorithm and the lack of a sensible filter to reject cases of considerable confusion of sources. This led to trying to stuff many large cutouts into one file. Second, and the one related to a problem with ASDF, is that the same cutout array appeared both in a FITS HDU, and in the ASDF content contained in the ASDF HDU, meaning that data was unnecessarily duplicated in the file.

braingram commented 2 years ago

Is there a minimal case for the asdf portion of this issue?

I'm looking at the jwst.datamodels save code and it appears to be using the asdf validation step to create the hdu list based on the model schema. This step mentions the MultiExposureModel which includes some custom schema modification code.

Unfortunately, if I generate a MultiExposureModel, add an Image as an exposure and save it I am not seeing duplication of the data.

perrygreenfield commented 2 years ago

Read the custom schema modification code and tell me what you think it actually does.

What was the actual type of that image?

braingram commented 2 years ago

Here's the test I ran: https://gist.github.com/braingram/1e7505fd98f66fc6e0c5b8852bdda2eb

It looks like it's taking the core schema meta data stripping all fits_keyword and fits_hdu key/value pairs , then copying it (overlaying it) into exposures.meta of the multiexposure schema. This sort of matches the first sentence of the preceding comment but I don't follow the second sentence.

        # Create a new core.meta that will co-locate
        # with each exposure entry. This is done
        # by saving the meta information in a separate
        # FITS HDU.

Is the duplicated data under the meta key of the exposures or somewhere else in the file?

braingram commented 2 years ago

Here's a minimal case without jwst

import asdf
import asdf.fits_embed
import astropy.io.fits
import numpy

arr = numpy.zeros((1000, 100), dtype='f4')
# make hdu list with array from tree
hdulist = astropy.io.fits.HDUList()
hdulist.append(astropy.io.fits.PrimaryHDU())
hdu = astropy.io.fits.ImageHDU(name='SCI')
# assign data after creation to avoid a copy
hdu.data = arr
hdulist.append(hdu)

# save tree and hdu list
aif = asdf.fits_embed.AsdfInFits(hdulist, {'arr': arr})
aif.write_to('test1.fits', overwrite=True)

# read in asdf in fits
aif2 = asdf.open('test1.fits', mode='r')
aif2.write_to('test2.fits', overwrite=True)

ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size   = {ff2[-1].size}")

This produces the following output:

original size = 571
resave size   = 400616

Presumably we do not want to lose links between arrays in the tree and HDU items (is this the right term?). This could be made into a test to add to test_fits_embed.

Per some emails with @perrygreenfield this issue appears to be related to linking HDUs and arrays and can be improved by changing find_or_create_block_for_array in EmbeddedBlockManager in fits_embed and util.get_array_base which both don't handle NDArrayType.

diff --git a/asdf/fits_embed.py b/asdf/fits_embed.py
index b436281..093b416 100644
--- a/asdf/fits_embed.py
+++ b/asdf/fits_embed.py
@@ -92,7 +92,9 @@ class _EmbeddedBlockManager(block.BlockManager):
     def find_or_create_block_for_array(self, arr, ctx):
         from .tags.core import ndarray

-        if not isinstance(arr, ndarray.NDArrayType):
+        #import pdb; pdb.set_trace()
+        #if not isinstance(arr, ndarray.NDArrayType):
+        if True:
             base = util.get_array_base(arr)
             for hdu in self._hdulist:
                 if hdu.data is None:
diff --git a/asdf/util.py b/asdf/util.py
index a45c863..a615da4 100644
--- a/asdf/util.py
+++ b/asdf/util.py
@@ -85,8 +85,9 @@ def get_array_base(arr):
     For a given Numpy array, finds the base array that "owns" the
     actual data.
     """
+    from .tags.core import ndarray
     base = arr
-    while isinstance(base.base, np.ndarray):
+    while isinstance(base.base, (np.ndarray, ndarray.NDArrayType)):
         base = base.base
     return base

My original thoughts on using numpy.shares_memory instead of comparing base aren't compatible with the test_array_view_different_layout added in PR https://github.com/asdf-format/asdf/pull/930 to avoid writing invalid data in fits. I haven't replicated this but the diff above (which doesn't use shares_memory) doesn't cause the test to fail.

braingram commented 2 years ago

Duplication can also occur if the data attribute of an hdu item is set to a non-ndarray type object. This will result in a copy (for example, assigning an NDArrayType to hdu.data will result in a copy) and data will then be duplicated as the link between the item in the hdulist and the array in the tree will be broken.

jdavies-st commented 2 years ago

Here's another minimal case indicating that it seems (I think) to be coming from asdf, independent of anything in jwst.datamodels or stdatamodels:

from jwst import datamodels
from asdf.fits_embed import AsdfInFits
from astropy.io import fits

dm = datamodels.ImageModel((2048, 2048))
dm.save("test.fits")
with fits.open("test.fits") as hdulist:
    print(hdulist.info())

af = AsdfInFits.open("test.fits")
af.write_to("test2.fits")
with fits.open("test2.fits") as hdulist:
    print(hdulist.info())

and one gets as output

Filename: test.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()      
  1  SCI           1 ImageHDU         9   (2048, 2048)   float32   
  2  ASDF          1 BinTableHDU     11   1R x 1C   [683B]   

Filename: test2.fits
No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU       7   ()      
  1  SCI           1 ImageHDU         9   (2048, 2048)   float32   
  2  ASDF          1 BinTableHDU     11   1R x 1C   [16777944B]

Note that initializing the AsdfInFITS object with a FITS file bypasses anything that jwst.datamodels and stdatamodels might be doing in terms of validation or shuffling. I think.

EDIT: oops, I realized that above there's already a minimal case with jwst. 😅

braingram commented 2 years ago

Thanks! Confirmation is always good :)

Here's another minimal case using stdatamodels.

import astropy.io.fits
import stdatamodels
import numpy

# --- this bit here replicates test infrastructure already in stdatamodels ---
cfg = asdf.get_config()
path_to_stdatamodels_repo = FILL_THIS_IN_FOR_YOUR_CHECKOUT
schema_dir = f"{path_to_stdatamodels_repo}/tests/schemas"
cfg.add_resource_mapping(asdf.resource.DirectoryResourceMapping(
    schema_dir, "http://example.com/schemas"))

class FitsModel(stdatamodels.DataModel):
    schema_url = "http://example.com/schemas/fits_model"
# --- end of replication --

arr = numpy.zeros((1000, 100), dtype='f4')
m = FitsModel(arr)
m.save('test1.fits')

m2 = FitsModel.from_fits('test1.fits')
m2.save('test2.fits')

ff1 = astropy.io.fits.open('test1.fits')
ff2 = astropy.io.fits.open('test2.fits')
print(f"original size = {ff1[-1].size}")
print(f"resave size   = {ff2[-1].size}")

I believe the fix will have to occur in ASDF (with discussion commencing on this PR https://github.com/asdf-format/asdf/pull/1233) but adding these cases as tests to stdatamodels and perhaps also jwst would help to catch any regressions. I will open PRs with tests for jwst and datamodels (with tests that currently fail and will be marked as skip). Once we have a fix we can unskip these (or wait to accept the PRs until the fix is available).

perrygreenfield commented 2 years ago

I'm trying to think of a case where an inadvertent reference to an HDU array would happen. I'm asking if any JWST products appear with identical DQ or ERR arrays (e.g., with constant values, or identical defaults). That may affect whether this is an issue in the only case this stuff is likely to be used for.

perrygreenfield commented 2 years ago

Even if this leads to multiple ASDF attributes pointing to the same HDU (with other HDUs having the same array contents that really should have matched), and some of those others not actually being used, I wonder if that really is a problem. The point of the FITS files is providing users with something they are comfortable with and know how to interpret. In most cases they would ignore the ADSF content. And for code that does use the ASDF content, it shouldn't care where the array values came from in the FITS file. The main worry with the ASDF is if different components are sharing the same array when they shouldn't. In most (and hopefully all) cases I think tasks make copies of the arrays, but I'm not completely sure of that. Another question for Howard.

perrygreenfield commented 2 years ago

I asked Howard the following questions:

1) We are looking more at this issue with duplicated arrays. Do products ever get generated with identical default DQ and/or ERR arrays within the same file? Or are they generally written out with populated results when such arrays appear? 2) A related question: are all data read from FITS files copied before modification? (Some of the solutions being considered may end up with ASDF elements sharing the data from the same HDU when the file is read). Would that cause a problem?

And Howard's responses:

As to the first question, the DQ arrays get populated with something non-default in like the very first step of the pipeline and proceed with more updates from there. So I doubt there are ever any products where multiple DQ arrays would have all default (e.g. zero) values. ERR arrays don't get populated until the ramp_fit step, which is the end of Stage 1 processing (rate products). So it only be intermediate products saved before that stage (e.g. our 4D "ramp" product) that might still have all default ERR array values. But each product would only have 1 ERR array. It isn't until mid-way through Stage 2 processing that we ever get products containing multiple DQ or ERR arrays, which occurs for spectroscopic modes that have multiple sources split out from the original single image. But by that time the DQ and ERR arrays should never still be at default values.

Most of the time we have steps make a copy of the input data before making changes. Originally we had things make corrections in-place, just to save memory (i.e. no need to have 2 copies of the data in memory all the time). That's fine when running a default pipeline end-to-end from the command line, because you get the correct final results in the output files, but when running interactively from within python, it's not always desirable. Users may not want to have their inputs tampered with, especially if they want to run a step multiple times to tweak params or whatever.

asdf-format / asdf

Unnecessary duplicate arrays when using asdf embedded in FITS files #1232