Closed mathomp4 closed 1 month ago
Hi @mathomp4 You raise good questions and I would be excited to help answer what I can once the dust on CF settles (ping me then). For the time-being...
Does CF have a prescribed way of specifying a new algorithm?
No, not yet. But you can propose a new algortithm name/definition to CF. If your BitShave simply sets NSB trailing bits to 0 (rather than IEEE-rounding them to zero like BitRound) then I agree BitShave is the most sensible name for that algorithm, and I suggest simply following the draft conventions and using "BitShave" as the algorithm name in your output. If your BitShave actually sets enough trailing bits to 0 (rather than IEEE-rounding them to zero like BitRound) to achieve NSD digits of precision then I suggest that DigitShave might be a better name for the algorithm.
If a collection uses different types of quantization, I assume one just makes quantization_info1, quantization_info2, etc?
Yes, exactly.
Charlie
@czender Thanks! I'll try my best to get things partly there. If nothing else, I can get the netCDF quantize types in and looking right. And if we get close now, as the PR evolves, we can make changes more easily.
And talking with @tclune it looks like our bit shaving is best described as "LevelBitShave". Bit more complex than basic bit-shaving as it does look at each level separately. Maybe "LevelMeanBitShave"? Probably need a little discussion internally and once happy, work with that. (At the moment our bit-shaving is controlled in a different way than the netCDF quantize. Maybe in MAPL3 we can join the two.)
Oh, and have you figured out a nice formula for maximum_relative_error
in the BR cases? I saw what you have for bitgroom (which makes nice math sense), but not sure about BR. (Maybe 10^(-nsd) 🤷🏼 )
BitRound has a maximum relative error (MRE) of 2^(-(NSB+1)). The base-10 quantization methods like Granular BitGroom do not have an MRE, AFAICT. They do have a maximum absolute error (MAE). The MAE formula is shown in Delaunay, X., A. Courtois, and F. Gouillon (2019), Evaluation of lossless and lossy algorithms for the compression of scientific datasets in netCDF-4 or HDF5 files, Geosci. Model Dev., 12(9), 4099-4113, doi:10.5194/gmd-12-4099-2019.
Oooh. Okay. Well, I fooled around with Claude (took a bit of prodding as it kept doing things oddly) and, well, I can duplicate the MAE column in Table 3 in that paper as a function of nsd. It's a weird formula...but it does work. So good on Claude I guess.
That was sort of fun. Probably not the right formula, but, interesting.
@czender Just for fun expressed in Fortran:
function calculate_mae(nsd) result(mae)
implicit none
integer, intent(in) :: nsd
real(kind=REAL32) :: mae
real(kind=REAL32) :: mae_base
integer :: correction
mae_base = 4.0 * (1.0/16.0)**floor(real(nsd)/2.0) * (1.0/8.0)**ceiling(real(nsd)/2.0)
if (nsd > 2 .and. mod(nsd, 2) == 0) then
correction = 2
else if (nsd == 7) then
correction = 2
else
correction = 1
end if
mae = mae_base * correction
end function calculate_mae
Run this from nsd=1
to nsd=7
and:
nsd: 1, Calculated mae (r4): 0.5000000000
nsd: 2, Calculated mae (r4): 0.0312500000
nsd: 3, Calculated mae (r4): 0.0039062500
nsd: 4, Calculated mae (r4): 0.0004882812
nsd: 5, Calculated mae (r4): 0.0000305176
nsd: 6, Calculated mae (r4): 0.0000038147
nsd: 7, Calculated mae (r4): 0.0000004768
Yay! Table 3! No idea if this actually means anything in a real data set, but it was fun to duplicate it.
It had such a weird nice pattern with the even-odd, but then 7 just sort of messed it up 😄
@czender recently released NCO 5.2.7. From the announcement:
A quick look around the CF site leads to:
as being the discussions about this. Reading time!
Since something like this will be in the CF Conventions "soon" (on the scale of CF Convention changes), MAPL should work toward it as well.
Now, at the moment, if we set, say:
in History, we get out what I think are the "what comes from netCDF" metadata:
So not quite the same. I mean, some info is there but we should issue our own metadata writes to match the CF Draft including
quantization_info
...container? (Not sure.)Some questions:
nsd
is per-variable metadata, but if a collection uses different types of quantization[^1], I assume one just makesquantization_info1
,quantization_info2
, etc? We'll need to read the draft conventions to figure it out.[^1]: This might be moot as perhaps using different quantization algorithms per file might just be something we just don't want to support? It does seem odd one might use BitGroom and GranularBR in the same file for different variables.