GEOS-ESM / MAPL

MAPL is a foundation layer of the GEOS architecture, whose original purpose is to supplement the Earth System Modeling Framework (ESMF)
https://geos-esm.github.io/MAPL/
Apache License 2.0
26 stars 17 forks source link

MAPL History quantization metadata should echo NCO (aka CF!) #2926

Closed mathomp4 closed 1 month ago

mathomp4 commented 3 months ago

@czender recently released NCO 5.2.7. From the announcement:

A. Version 5.2.7 implements the final (we think) CF quantization
metadata convention. The main change is that the keywords and
container variable are now "quantization" instead of
"lossy_compression". 
ncks -7 -v ps,ts --qnt_alg=btr --qnt default=9 --qnt ps=13 --cmp='shf|zst' in.nc out.nc
ncks -m -C -v ps,ts,quantization_info out.nc
netcdf out {
...
   float ps(time,lat,lon) ;
      ps:standard_name = "surface_air_pressure" ;
      ps:units = "Pa" ;
      ps:quantization = "quantization_info" ;
      ps:quantization_nsb = 13 ;
      ps:quantization_maximum_relative_error = 6.103516e-05f ;

    char quantization_info ;
      quantization_info:algorithm = "bitround" ;
      quantization_info:implementation = "NCO version 5.2.7" ;

    float ts(time) ;
      ts:standard_name = "surface_temperature" ;
      ts:units = "K" ;
      ts:quantization = "quantization_info" ;
      ts:quantization_nsb = 9 ;
      ts:quantization_maximum_relative_error = 0.0009765625f ;
} // group /
http://nco.sf.net/nco.html#qnt_alg
http://nco.sf.net/nco.html#qnt

A quick look around the CF site leads to:

as being the discussions about this. Reading time!

Since something like this will be in the CF Conventions "soon" (on the scale of CF Convention changes), MAPL should work toward it as well.

Now, at the moment, if we set, say:

  geosgcm_prog.quantize_level: 4,
  geosgcm_prog.quantize_algorithm: 'GranularBR',

in History, we get out what I think are the "what comes from netCDF" metadata:

        float T(time, lev, lat, lon) ;
                T:_QuantizeGranularBitRoundNumberOfSignificantDigits = 4 ;
...

So not quite the same. I mean, some info is there but we should issue our own metadata writes to match the CF Draft including quantization_info...container? (Not sure.)

Some questions:

  1. What do we do about our home-grown bit-shaving? It is sort of like BitGroom, but I think the implementation is slightly different. Does CF have a prescribed way of specifying a new algorithm?
  2. At the moment, MAPL can only apply quantization on a per-collection basis, i.e., if like above, everything in geosgcm_prog gets GranularBR at 4. This is often not ideal (some fields need more precision than others). Eventually this should probably be at a per-field basis but something like that will probably need to wait for the yaml-based MAPL3/History3G. But, once we can do per-field quantization, the nsd is per-variable metadata, but if a collection uses different types of quantization[^1], I assume one just makes quantization_info1, quantization_info2, etc? We'll need to read the draft conventions to figure it out.

[^1]: This might be moot as perhaps using different quantization algorithms per file might just be something we just don't want to support? It does seem odd one might use BitGroom and GranularBR in the same file for different variables.

czender commented 3 months ago

Hi @mathomp4 You raise good questions and I would be excited to help answer what I can once the dust on CF settles (ping me then). For the time-being...

Does CF have a prescribed way of specifying a new algorithm?

No, not yet. But you can propose a new algortithm name/definition to CF. If your BitShave simply sets NSB trailing bits to 0 (rather than IEEE-rounding them to zero like BitRound) then I agree BitShave is the most sensible name for that algorithm, and I suggest simply following the draft conventions and using "BitShave" as the algorithm name in your output. If your BitShave actually sets enough trailing bits to 0 (rather than IEEE-rounding them to zero like BitRound) to achieve NSD digits of precision then I suggest that DigitShave might be a better name for the algorithm.

If a collection uses different types of quantization, I assume one just makes quantization_info1, quantization_info2, etc?

Yes, exactly.

Charlie

mathomp4 commented 3 months ago

@czender Thanks! I'll try my best to get things partly there. If nothing else, I can get the netCDF quantize types in and looking right. And if we get close now, as the PR evolves, we can make changes more easily.

And talking with @tclune it looks like our bit shaving is best described as "LevelBitShave". Bit more complex than basic bit-shaving as it does look at each level separately. Maybe "LevelMeanBitShave"? Probably need a little discussion internally and once happy, work with that. (At the moment our bit-shaving is controlled in a different way than the netCDF quantize. Maybe in MAPL3 we can join the two.)

Oh, and have you figured out a nice formula for maximum_relative_error in the BR cases? I saw what you have for bitgroom (which makes nice math sense), but not sure about BR. (Maybe 10^(-nsd) 🤷🏼 )

czender commented 3 months ago

BitRound has a maximum relative error (MRE) of 2^(-(NSB+1)). The base-10 quantization methods like Granular BitGroom do not have an MRE, AFAICT. They do have a maximum absolute error (MAE). The MAE formula is shown in Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9), 4099-4113, doi:10.5194/gmd-12-4099-2019.

mathomp4 commented 3 months ago

Oooh. Okay. Well, I fooled around with Claude (took a bit of prodding as it kept doing things oddly) and, well, I can duplicate the MAE column in Table 3 in that paper as a function of nsd. It's a weird formula...but it does work. So good on Claude I guess.

That was sort of fun. Probably not the right formula, but, interesting.

mathomp4 commented 3 months ago

@czender Just for fun expressed in Fortran:

   function calculate_mae(nsd) result(mae)
      implicit none
      integer, intent(in) :: nsd
      real(kind=REAL32) :: mae
      real(kind=REAL32) :: mae_base
      integer :: correction

      mae_base = 4.0 * (1.0/16.0)**floor(real(nsd)/2.0) * (1.0/8.0)**ceiling(real(nsd)/2.0)

      if (nsd > 2 .and. mod(nsd, 2) == 0) then
         correction = 2
      else if (nsd == 7) then
         correction = 2
      else
         correction = 1
      end if

      mae = mae_base * correction
   end function calculate_mae

Run this from nsd=1 to nsd=7 and:

nsd: 1, Calculated mae (r4):    0.5000000000
nsd: 2, Calculated mae (r4):    0.0312500000
nsd: 3, Calculated mae (r4):    0.0039062500
nsd: 4, Calculated mae (r4):    0.0004882812
nsd: 5, Calculated mae (r4):    0.0000305176
nsd: 6, Calculated mae (r4):    0.0000038147
nsd: 7, Calculated mae (r4):    0.0000004768

Yay! Table 3! No idea if this actually means anything in a real data set, but it was fun to duplicate it.

It had such a weird nice pattern with the even-odd, but then 7 just sort of messed it up 😄