The FITS file format provides a way of compressing multi-dimensional images to reduce their size on disk. This functionality works by splitting the image up into "tiles" compressing them using one of a selection of compression algorithms and then saving the compressed bytes into a cell in a FITS binary table.
In Astropy this functionality is exposed by CompImageHDU. When loading the data associated with a compressed image, the whole HDU is passed through to a function implemented in C called decompress_hdu (the same happens on compress when writing). This HDU object is then used to pass the whole binary table (and it's heap) to the cfitsio library which returns a complete decompressed array of the image. (Again the same happens for compress but a binary table is returned).
This functionality is the only reason astropy uses cfitsio, yes all 168,643 lines of it. Implementing this functionality this way also means that it is impossible to load and decompress only part of the image, the whole image is always fully loaded.
Description of this PR
This pull request is a big step towards implementing all the FITS tiled image compression in Python. All the functionality of decompress_hdu and compress_hdu currently provided in compressionmodule.c is replaced by equivalent Python code. The compression and decompression of the individual tiles are implemented in the style of numcodecs Codec classes (more on this below). The implementation of the compression and decompression algorithms for RICE_1, PLIO_1 and HCompress are still used from the cfitsio code, but only a small number of files from the cfitsio library are now needed.
While in this PR we do not expose any functionality for (de)compressing individual tiles, the primitives for this work now exist, as it is possible to use these Codec classes to (de)compress the bytes of an individual tile, and the C wrappers for the cfitsio functions only handle (de)compressing a single tile at a time.
Testing
To validate the implementation in this PR we needed to be able to test against FITS files not (de)compressed with Astropy. We approached this in two ways.
1) Some canonical test files downloaded from the FITS compression pages have been added to the repo. These are small, but provide a useful "known good" file for each compression algorithm.
2) A new heavily parametrised test_fitsio.py file has been added. This file uses the fitsio package to write or read files read or written by Astropy and then the data is compared. Running this test file (obviously) needs the fitsio package to be installed, which requires a system install of cfitsio to compile the sdist against. This test file is automatically skipped if fitsio can not be imported, but a tox factor has been added to force it to be run and to require fitsio to ensure it is always run on the CI.
2a) For extra credit and to hunt segfaults :gun:, all these tests have been run with valgrind and no memory leaks in our code (just fitsio) were found.
New Functionality or Bug Fixes in this PR
This PR aimed to change nothing as far as users of the library are concerned. Unless someone was importing directly from astropy.io.fits.compression (which I really hope they weren't) then this change should go unnoticed. However, in the process of rewriting the compression functionality we found some things which were missing from Astropy or the implementations in cfitsio these things are:
Compilation time for astropy has now been reduced by 40% with this PR - on a Linux desktop we got down from 66.70 seconds to 41.24 seconds :zap:
>3D data should now be supported by Astropy. This was not supported before as cfitsio can not read or write these data.
Compressing a floating point tile containing NaN will no longer cause the whole tile to be read as NaN - astropy/astropy#11212 (see also esheldon/fitsio#356).
Various existing segfaults with PLIO_1 are fixed. (I noticed segfaults with int32 as described in astropy/astropy#9715), but also there's a pathological case where the compressed bytes can be longer than the uncompressed bytes, which triggers a segfault in cfitsio, which is fixed in our new wrapper.
Next Steps
This PR could have removed all of cextern/cfitsio but we decided that we shouldn't for the initial review (the diff is obnoxious). We can choose to either do this as the last thing before merging this PR, or in a follow up PR. We did have to already remove some code from imcompress.c here in order to get things to work.
Related Issues
fixe astropy/astropy#11212
fixe astropy/astropy#9715
Will fix astropy/astropy#541 and astropy/astropy#9625 when cfitsio removed
This pull request addresses the first checklist item in astropy/astropy#3895 and opens the door to some (or all) of the others.
Checklist for package maintainer(s)
This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.
[ ] Do the proposed changes actually accomplish desired goals?
[ ] Is rebase and/or squash necessary? If so, please provide the author with appropriate instructions. Also see "When to rebase and squash commits".
[ ] Did the CI pass? If no, are the failures related? If you need to run daily and weekly cron jobs as part of the PR, please apply the Extra CI label. Codestyle issues can be fixed by the bot.
[ ] Is a change log needed? If yes, did the change log check pass? If no, add the no-changelog-entry-needed label. If this is a manual backport, use the skip-changelog-checks label unless special changelog handling is necessary.
[ ] Is this a big PR that makes a "What's new?" entry worthwhile and if so, is (1) a "what's new" entry included in this PR and (2) the "whatsnew-needed" label applied?
[ ] Is a milestone set? Milestone must be set but astropy-bot check might be missing; do not let the green checkmark fool you.
[ ] At the time of adding the milestone, if the milestone set requires a backport to release branch(es), apply the appropriate backport-X.Y.x label(s) before merge.
👋 Thank you for your draft pull request! Do you know that you can use [ci skip] or [skip ci] in your commit messages to skip running continuous integration tests until you are ready?
This pull request contains phase one of the work proposed in the cycle 3 funding call.
Background
The FITS file format provides a way of compressing multi-dimensional images to reduce their size on disk. This functionality works by splitting the image up into "tiles" compressing them using one of a selection of compression algorithms and then saving the compressed bytes into a cell in a FITS binary table.
In Astropy this functionality is exposed by
CompImageHDU
. When loading the data associated with a compressed image, the whole HDU is passed through to a function implemented in C called decompress_hdu (the same happens on compress when writing). This HDU object is then used to pass the whole binary table (and it's heap) to the cfitsio library which returns a complete decompressed array of the image. (Again the same happens for compress but a binary table is returned).This functionality is the only reason astropy uses cfitsio, yes all 168,643 lines of it. Implementing this functionality this way also means that it is impossible to load and decompress only part of the image, the whole image is always fully loaded.
Description of this PR
This pull request is a big step towards implementing all the FITS tiled image compression in Python. All the functionality of
decompress_hdu
andcompress_hdu
currently provided in compressionmodule.c is replaced by equivalent Python code. The compression and decompression of the individual tiles are implemented in the style of numcodecsCodec
classes (more on this below). The implementation of the compression and decompression algorithms forRICE_1
,PLIO_1
andHCompress
are still used from the cfitsio code, but only a small number of files from the cfitsio library are now needed.While in this PR we do not expose any functionality for (de)compressing individual tiles, the primitives for this work now exist, as it is possible to use these
Codec
classes to (de)compress the bytes of an individual tile, and the C wrappers for the cfitsio functions only handle (de)compressing a single tile at a time.Testing
To validate the implementation in this PR we needed to be able to test against FITS files not (de)compressed with Astropy. We approached this in two ways.
1) Some canonical test files downloaded from the FITS compression pages have been added to the repo. These are small, but provide a useful "known good" file for each compression algorithm. 2) A new heavily parametrised
test_fitsio.py
file has been added. This file uses thefitsio
package to write or read files read or written by Astropy and then the data is compared. Running this test file (obviously) needs thefitsio
package to be installed, which requires a system install ofcfitsio
to compile the sdist against. This test file is automatically skipped iffitsio
can not be imported, but a tox factor has been added to force it to be run and to requirefitsio
to ensure it is always run on the CI. 2a) For extra credit and to hunt segfaults :gun:, all these tests have been run with valgrind and no memory leaks in our code (justfitsio
) were found.New Functionality or Bug Fixes in this PR
This PR aimed to change nothing as far as users of the library are concerned. Unless someone was importing directly from
astropy.io.fits.compression
(which I really hope they weren't) then this change should go unnoticed. However, in the process of rewriting the compression functionality we found some things which were missing from Astropy or the implementations in cfitsio these things are:NaN
will no longer cause the whole tile to be read asNaN
- astropy/astropy#11212 (see also esheldon/fitsio#356).int32
as described in astropy/astropy#9715), but also there's a pathological case where the compressed bytes can be longer than the uncompressed bytes, which triggers a segfault in cfitsio, which is fixed in our new wrapper.Next Steps
This PR could have removed all of
cextern/cfitsio
but we decided that we shouldn't for the initial review (the diff is obnoxious). We can choose to either do this as the last thing before merging this PR, or in a follow up PR. We did have to already remove some code from imcompress.c here in order to get things to work.Related Issues
fixe astropy/astropy#11212 fixe astropy/astropy#9715 Will fix astropy/astropy#541 and astropy/astropy#9625 when cfitsio removed This pull request addresses the first checklist item in astropy/astropy#3895 and opens the door to some (or all) of the others.
Checklist for package maintainer(s)
This checklist is meant to remind the package maintainer(s) who will review this pull request of some common things to look for. This list is not exhaustive.
Extra CI
label. Codestyle issues can be fixed by the bot.no-changelog-entry-needed
label. If this is a manual backport, use theskip-changelog-checks
label unless special changelog handling is necessary.astropy-bot
check might be missing; do not let the green checkmark fool you.backport-X.Y.x
label(s) before merge.