Add quantize support in pFIO

mathomp4 commented 1 year ago

Soon we'll be moving to a Baselibs with the latest netCDF which has support for the quantize updates from @czender, @edwardhartnett, and colleagues. So, we should add support to that here in MAPL/pFIO.

Once we have it in, we would need to do testing to compare it to our current bit-shaving routine (which I think is sort of equivalent bitgrooming with all zeroes?)

mathomp4 commented 1 year ago

Relevant docs:

https://docs.unidata.ucar.edu/netcdf-c/current/group__variables.html#ga669e4b23b9b7fb321b995982972cdd93

mathomp4 commented 1 year ago

Here is a table of some results of runs at C720.

GBR = Granular BitRound, BR = BitRound; All runs save "nodeflate" are using a deflate level of 1 (so zlib).

I decided to add the value of T in Kelvin from the file as given by:

ncks -v T -d lev,47 -d lon,1000 -d lat,1000 -s "%16.10f\n" -H -C $file

and the compression is relative to the deflate with no bit-shaving since that is our "default" output setup from History.

Run	Shaving Level	Size of geosgcm_prog (B)	Compression	Value of T at 1000-1000-47	Difference	% Diff
Stock	nodeflate	8018095264	200%	214.6440429688	0.0000000000	0.0000%
Stock	full	4017618214	100%	214.6440429688	0.0000000000	0.0000%
Stock	nbits=10	2752834049	69%	214.6441497803	-0.0001068115	0.0000%
GBR	nsd=2	699334501	17%	216.0000000000	-1.3559570312	-0.6317%
GBR	nsd=3	1279336525	32%	215.0000000000	-0.3559570312	-0.1658%
GBR	nsd=4	1818289524	45%	214.6250000000	0.0190429688	0.0089%
GBR	nsd=5	2515539825	63%	214.6406250000	0.0034179688	0.0016%
GBR	nsd=6	3233172067	80%	214.6445312500	-0.0004882812	-0.0002%
BR	nsb=7	936043985	23%	215.0000000000	-0.3559570312	-0.1658%
BR	nsb=10	1497581922	37%	214.6250000000	0.0190429688	0.0089%
BR	nsb=13	1538078618	38%	214.6406250000	0.0034179688	0.0016%
BR	nsb=16	1897122117	47%	214.6445312500	-0.0004882812	-0.0002%

mathomp4 commented 1 year ago

I might need to invoke the name of @czender. From my "value of T" it looks like I managed to find (roughly?) equivalent values of GBR-nsd and BR-nsb, e.g., GBR@nsd=5 looks to correspond to BR@nsb=13. But I am surprised the compression ratios are so different! 63% vs 38%!

I am trying to read through the NCO docs and the GMD paper of @rkouznetsov but, well, lots of information there! 😄

Perhaps one shouldn't focus on a single point out of 199203840 (2880144148). Obviously need to figure out a more global metric of difference!

czender commented 1 year ago

Hi Matt, BitRound zeros the same number of bits for all values in the field. Granular BitRound works at a more granular level: the number of bits is zeros depends on each value. The of bits stored/zeroed is selected to retain the specified NSD. Thus a comparison of the compression or error resulting from BR or GBR for a single value that you have done needs to be rethought so that it characterzes the (absolute) mean error for the entire field. I assume the compression numbers already show that. If so, you would only need to redo the error characterization.

rkouznetsov commented 1 year ago

Hi Matt,

Every lossy compression makes an irreversible damage to a dataset so one has to really know what they do.

The basic workflow requires you to specify acceptable margins (both metrics and values) of introduced error and the properties of the dataset that should be kept. Then you maximize the compression ratio within these constrains. The latter stage is technical and straightforward. I would recommend to implement it inside your model just not to reduce the portability of the model by tightening version dependencies. It is simple and you'll have a full control of what it does.

Deciding on the acceptable error is a tricky part. There is no ready-made solutions or simple and universal hard science behind. It really depends on the nature of the variable and the intended application. Moreover there is a lot of ambiguous terminology (https://github.com/Unidata/netcdf-c/discussions/2406). Even if you specify the acceptable error, you always should check if the underlying software does what you think it should. E.g. same NSD value means ten times different relative error even among two minor-version revisions of the same software (https://github.com/nco/nco/issues/256), and it is considered as a feature rather than a bug.

The metrics I prefer is maximum absolute and/or relative error. These metrics are simple, and invariant with respect to sub-setting, striding and scaling (absolute error scales with the data). If you are after the absolute error, none of the currently implemented Algorithms in NetCDF Quantize would work. My paper has a codelet for that. If you are after the way your data would look when printed with a given number of decimals (NSD in Zender 2015 terms), then Granular BitRound is the right thing, if you are after minimizing the relative error BitRound (in terms of NCO), aka "relative precision trimming" would be your choice. Implementing the latter is also quite straightforward (see the paper or the SILAM code).

I hope this helps.

GEOS-ESM / MAPL

Add quantize support in pFIO #1779