Maintain raw precision instead of down casting to float32

juntyr commented 4 months ago

This is a fix by @dmey to avoid losing precision when using cfgrib, which David developed while working on the field compression project, which I’ve taken up.

FussyDuck commented 4 months ago

All committers have signed the CLA.

juntyr commented 4 months ago

@dmey I hope you’re ok with me wanting to upstream your fix, as it’s greatly benefited me over the past year. Btw, do you need to sign the CLA as well?

dmey commented 4 months ago

Hi @juntyr, sure no problem--give it a go and let me know if I need to.

juntyr commented 4 months ago

Hi @juntyr, sure no problem--give it a go and let me know if I need to.

I signed it but the action still shows it as unsigned, so perhaps you need to sign it too?

dmey commented 4 months ago

it should be fine now

iainrussell commented 3 months ago

Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).

Potential API:

ds = xr.open_dataset('data.grib', engine='cfgrib', backend_kwargs={'values_dtype': 'float64'})

Do you think you could make this modification to the PR, or would you rather leave it as a request issue?

juntyr commented 3 months ago

Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).

Good point! What about the new version, where the dtype is inferred from the GRIB messages, so that cfgrib just exposes whatever dtype the GRIB file uses?

iainrussell commented 3 months ago

Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)

juntyr commented 3 months ago

Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)

Would there be a way to ask GRIB to return the value in the closest matching numpy dtype (for now just f32 and f64) or to have this functionality inside cfgrib to query the number of bits and choose the closest dtype to request from eccodes based on that?

iainrussell commented 3 months ago

Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.

juntyr commented 3 months ago

Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.

Thanks for your explanation! In that case, I agree that your proposed API would work best, and I can implement it.

In my use case, evaluating different compression strategies, I do you want to know what the native precision of the GRIB file is. Is there a better way than loading the data in flat64 precision, then casting to float32 and checking if any precision was lost?

iainrussell commented 3 months ago

Well, you must have ecCodes installed on your system if you have cfgrib there, so this on the command-line can help, if you're feeling strong enough to handle the gory details:

grib_dump -O gfs.t00z.pgrb2.0p25.f000

Look at Section 5 of the output and you will see something like this:

======================   SECTION_5 ( length=49, padding=0 )    ======================
1-4       section5Length = 49
5         numberOfSection = 5
6-9       numberOfValues = 1038240
10-11     dataRepresentationTemplateNumber = 3 [Grid point data - complex packing and spatial differencing (grib2/tables/2/5.0.table) ]
12-15     referenceValue = 2147.02
16-17     binaryScaleFactor = 0
18-19     decimalScaleFactor = 1
20        bitsPerValue = 9
21        typeOfOriginalFieldValues = 0 [Floating point (grib2/tables/2/5.1.table) ]
22        groupSplittingMethodUsed = 1 [General group splitting (grib2/tables/2/5.4.table) ]
23        missingValueManagementUsed = 0 [No explicit missing values included within data values (grib2/tables/2/5.5.table) ]
24-27     primaryMissingValueSubstitute = 1649987994
28-31     secondaryMissingValueSubstitute = 4294967295
32-35     numberOfGroupsOfDataValues = 32875
36        referenceForGroupWidths = 0
37        numberOfBitsUsedForTheGroupWidths = 4
38-41     referenceForGroupLengths = 1
42        lengthIncrementForTheGroupLengths = 1
43-46     trueLengthOfLastGroup = 41
47        numberOfBitsForScaledGroupLengths = 7
48        orderOfSpatialDifferencing = 2 [Second-order spatial differencing (grib2/tables/2/5.6.table) ]
49        numberOfOctetsExtraDescriptors = 2

This is a file downloaded from where you pointed me to. You can see that it uses 9 bits per value, but on top of that it has ccsds compression applied. You're probably sorry you asked now :)

juntyr commented 1 day ago

I had to move to a different fork and have reopened this PR at #407

ecmwf / cfgrib

Maintain raw precision instead of down casting to float32 #395