geoschem / integrated_methane_inversion

Integrated Methane Inversion workflow repository.
https://imi.readthedocs.org
MIT License
25 stars 19 forks source link

Data packing to reduce archived data volume #164

Closed eastjames closed 4 months ago

eastjames commented 11 months ago

Name and Institution (Required)

Name: James East Institution: Harvard ACMG

New IMI feature or discussion

It might be possible to reduce data storage needs by using lower precision data types to store XCH4 and Jacobians.

Packing data by storing it with 16-byte NC_SHORT data type instead of 32-byte NC_FLOAT data type might be able to reduce storage needs ~50%. There is precision lost in the data packing, which could affect the Jacobian, but the differences are much smaller order of magnitude than the data. The packed data can be opened and manipulated like normal.

I did a short test on 1 SpeciesConc file from a 2x2.5 72-level CH4 specialty simulation with code to reproduce and results below:

In the shell:

ncpdq -M flt_sht GEOSChem.SpeciesConc.20190819_0000z.nc4 test.nc

In python:

import xarray as xr
import matplotlib.pyplot as plt

with xr.open_dataset('GEOSChem.SpeciesConc.20190819_0000z.nc4') as inf:
    ds = inf.isel(lev=0).mean('time')*1e9

with xr.open_dataset('test.nc') as inf:
    ds2 = inf.isel(lev=0).mean('time')*1e9

diff = ds['SpeciesConcVV_CH4'] - ds2['SpeciesConcVV_CH4']
diff.plot()
plt.title(f'$\Delta$XCH$_4$ [ppb]\nOriginal (110mb) minus packed (50mb)\nmean difference at surface = {diff.mean().values:0.6f} ppb')

The original file size is 110mb, the packed file size is 50mb. The difference after reopening the file and plotting the CH4 at the surface level is shown below with units ppb. Differences appear random except at the poles. The largest differences at the surface are ~0.015 ppb. This could additionally reduce the data storage needs by potentially ~half on top of reductions from not storing vertical profiles.

In this test I did the data packing with NCO ncpdq https://nco.sourceforge.net/nco.html#ncpdq but there are probably other tools to do the same thing.

compressfig2

eastjames commented 10 months ago

Reopening packed data, concatenating, and then saving to disk introduces a possible pitfall and error, described here: https://github.com/pydata/xarray/issues/5739

@laestrada @jimmielin @nicholasbalasus

Just making this note since we talked about it.

jimmielin commented 10 months ago

Thanks @eastjames.

As we discussed during methane subgroup I wonder if we can output at lower precision in GEOS-Chem. Right now most output is in either real4 or real8 (this is defined in registry or the state variables and then used by history_netcdf_mod.F90).

But I also had another question. NC_SHORT is an integer data type according to the netcdf docs so it shouldn't have decimal values, but I might be reading the docs wrong. Did you convert the data to another unit like ppb before converting the values to NC_SHORT?

I think we should stick with ncpdq because while it's not python-native it's much higher performance than whatever post-processing done in Python because Python will end up calling similar C++/Fortran routines anyway but with more overhead. It should be easy to use Python to call the shell command ncpdq ... to compress all files in a folder in a loop, if desired for post-processing. Assuming we can avoid the pitfalls mentioned above.

eastjames commented 10 months ago

Thanks @jimmielin. You're right about NC_SHORT. ncpdq is smart enough to do all the conversions, so no I didn't convert manually. My (basic) understanding of data packing is that you first compute a scale_factor and add_offset for the data. Then, the original data are converted to values to be stored with

packed_value = floor((unpacked_value - add_offset) / scale_factor)

Finally, packed_value is stored as an integer and add_offset and scale_factor are encoded with the variable. When the netcdf file is reopened, data are "unpacked" back to NC_FLOAT using add_offset and scale_factor.

The netcdf Best Practices guide describes how to calculate add_offset and scale_factor. ncpdq does this calculation automatically.

The danger is that if files are saved with different add_offset and scale_factor, users can run into trouble when opening/manipulating/saving the data. So, it might be dangerous to do this by default in GEOS-Chem. But, it still might be useful in IMI. For example, if GEOS-Chem wrote packed data to disk in the IMI context only with consistent add_offset and scale_factor, then IMI could read these and properly unpack them, and potentially halve the disk space required for creating Jacobians, even before archiving the data.

github-actions[bot] commented 4 months ago

This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.

msulprizio commented 4 months ago

@eastjames Is this still something we should consider adding? If so, would you be able to create a pull request?

eastjames commented 4 months ago

@msulprizio with the changes to IMI practices including cleanup scripts by @laestrada and new disk space, I don't think this is a priority. What do you think?

msulprizio commented 4 months ago

@eastjames I was thinking the same but wanted to check with you. I will close out this issue for now but we can always consider adding this feature in the future if needed.