Closed eastjames closed 4 months ago
Reopening packed data, concatenating, and then saving to disk introduces a possible pitfall and error, described here: https://github.com/pydata/xarray/issues/5739
@laestrada @jimmielin @nicholasbalasus
Just making this note since we talked about it.
Thanks @eastjames.
As we discussed during methane subgroup I wonder if we can output at lower precision in GEOS-Chem. Right now most output is in either real4 or real8 (this is defined in registry or the state variables and then used by history_netcdf_mod.F90
).
But I also had another question. NC_SHORT
is an integer data type according to the netcdf docs so it shouldn't have decimal values, but I might be reading the docs wrong. Did you convert the data to another unit like ppb before converting the values to NC_SHORT
?
I think we should stick with ncpdq
because while it's not python-native it's much higher performance than whatever post-processing done in Python because Python will end up calling similar C++/Fortran routines anyway but with more overhead. It should be easy to use Python to call the shell command ncpdq ...
to compress all files in a folder in a loop, if desired for post-processing. Assuming we can avoid the pitfalls mentioned above.
Thanks @jimmielin. You're right about NC_SHORT
. ncpdq
is smart enough to do all the conversions, so no I didn't convert manually. My (basic) understanding of data packing is that you first compute a scale_factor
and add_offset
for the data. Then, the original data are converted to values to be stored with
packed_value = floor((unpacked_value - add_offset) / scale_factor)
Finally, packed_value
is stored as an integer and add_offset
and scale_factor
are encoded with the variable. When the netcdf file is reopened, data are "unpacked" back to NC_FLOAT
using add_offset
and scale_factor
.
The netcdf Best Practices guide describes how to calculate add_offset
and scale_factor
. ncpdq
does this calculation automatically.
The danger is that if files are saved with different add_offset
and scale_factor
, users can run into trouble when opening/manipulating/saving the data. So, it might be dangerous to do this by default in GEOS-Chem. But, it still might be useful in IMI. For example, if GEOS-Chem wrote packed data to disk in the IMI context only with consistent add_offset
and scale_factor
, then IMI could read these and properly unpack them, and potentially halve the disk space required for creating Jacobians, even before archiving the data.
This issue has been automatically marked as stale because it has not had recent activity. If there are no updates within 7 days it will be closed. You can add the "never stale" tag to prevent the issue from closing this issue.
@eastjames Is this still something we should consider adding? If so, would you be able to create a pull request?
@msulprizio with the changes to IMI practices including cleanup scripts by @laestrada and new disk space, I don't think this is a priority. What do you think?
@eastjames I was thinking the same but wanted to check with you. I will close out this issue for now but we can always consider adding this feature in the future if needed.
Name and Institution (Required)
Name: James East Institution: Harvard ACMG
New IMI feature or discussion
It might be possible to reduce data storage needs by using lower precision data types to store XCH4 and Jacobians.
Packing data by storing it with 16-byte NC_SHORT data type instead of 32-byte NC_FLOAT data type might be able to reduce storage needs ~50%. There is precision lost in the data packing, which could affect the Jacobian, but the differences are much smaller order of magnitude than the data. The packed data can be opened and manipulated like normal.
I did a short test on 1 SpeciesConc file from a 2x2.5 72-level CH4 specialty simulation with code to reproduce and results below:
In the shell:
In python:
The original file size is 110mb, the packed file size is 50mb. The difference after reopening the file and plotting the CH4 at the surface level is shown below with units ppb. Differences appear random except at the poles. The largest differences at the surface are ~0.015 ppb. This could additionally reduce the data storage needs by potentially ~half on top of reductions from not storing vertical profiles.
In this test I did the data packing with NCO
ncpdq
https://nco.sourceforge.net/nco.html#ncpdq but there are probably other tools to do the same thing.