Closed juntyr closed 1 day ago
@dmey I hope you’re ok with me wanting to upstream your fix, as it’s greatly benefited me over the past year. Btw, do you need to sign the CLA as well?
Hi @juntyr, sure no problem--give it a go and let me know if I need to.
Hi @juntyr, sure no problem--give it a go and let me know if I need to.
I signed it but the action still shows it as unsigned, so perhaps you need to sign it too?
it should be fine now
Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).
Potential API:
ds = xr.open_dataset('data.grib', engine='cfgrib', backend_kwargs={'values_dtype': 'float64'})
Do you think you could make this modification to the PR, or would you rather leave it as a request issue?
Hi @juntyr and @dmey, thanks for this change! However, I think this is something we would prefer to be an option since it will double the memory requirement (and we do have GRIB fields with over 1 billion values).
Good point! What about the new version, where the dtype is inferred from the GRIB messages, so that cfgrib just exposes whatever dtype the GRIB file uses?
Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)
Ah, nice idea, but it does not really work. GRIB files can encode their values in a number of ways including pretty much any number of bits per value (e.g. 3, 11, 18, 24) and these cannot be expressed as numpy arrays. When we get the array of values from the GIRB file via the ecCodes library, we call an ecCodes function to do so, and this function decodes from the file and always returns an array of float64, regardless of what the encoding in the original GRIB message was; then we convert to float32. There is also now a function to return the values as a float32 array, which in fact cfgrib really should be calling in the first place! So since there usually is not a numpy way of expressing the encoding in the GRIB file, the only choice we have is to ask ecCodes for float32 or float64 (it does the dirty work of doing the conversion); then we can always convert to another numpy representation if needed. (The current behaviour is inefficient because we ask ecCodes for a float64 array, and then we cast to a float32 array.)
Would there be a way to ask GRIB to return the value in the closest matching numpy dtype (for now just f32 and f64) or to have this functionality inside cfgrib to query the number of bits and choose the closest dtype to request from eccodes based on that?
Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.
Not so easily, as GRIB files can also encode with things such as JPEG and PNG compression, and there are other more complex packing types. In any case, I really do prefer that the user is in control of the dtype of the resulting array so that they can control the precision and memory usage of the resulting xarray.
Thanks for your explanation! In that case, I agree that your proposed API would work best, and I can implement it.
In my use case, evaluating different compression strategies, I do you want to know what the native precision of the GRIB file is. Is there a better way than loading the data in flat64 precision, then casting to float32 and checking if any precision was lost?
Well, you must have ecCodes installed on your system if you have cfgrib there, so this on the command-line can help, if you're feeling strong enough to handle the gory details:
grib_dump -O gfs.t00z.pgrb2.0p25.f000
Look at Section 5 of the output and you will see something like this:
====================== SECTION_5 ( length=49, padding=0 ) ======================
1-4 section5Length = 49
5 numberOfSection = 5
6-9 numberOfValues = 1038240
10-11 dataRepresentationTemplateNumber = 3 [Grid point data - complex packing and spatial differencing (grib2/tables/2/5.0.table) ]
12-15 referenceValue = 2147.02
16-17 binaryScaleFactor = 0
18-19 decimalScaleFactor = 1
20 bitsPerValue = 9
21 typeOfOriginalFieldValues = 0 [Floating point (grib2/tables/2/5.1.table) ]
22 groupSplittingMethodUsed = 1 [General group splitting (grib2/tables/2/5.4.table) ]
23 missingValueManagementUsed = 0 [No explicit missing values included within data values (grib2/tables/2/5.5.table) ]
24-27 primaryMissingValueSubstitute = 1649987994
28-31 secondaryMissingValueSubstitute = 4294967295
32-35 numberOfGroupsOfDataValues = 32875
36 referenceForGroupWidths = 0
37 numberOfBitsUsedForTheGroupWidths = 4
38-41 referenceForGroupLengths = 1
42 lengthIncrementForTheGroupLengths = 1
43-46 trueLengthOfLastGroup = 41
47 numberOfBitsForScaledGroupLengths = 7
48 orderOfSpatialDifferencing = 2 [Second-order spatial differencing (grib2/tables/2/5.6.table) ]
49 numberOfOctetsExtraDescriptors = 2
This is a file downloaded from where you pointed me to. You can see that it uses 9 bits per value, but on top of that it has ccsds compression applied. You're probably sorry you asked now :)
I had to move to a different fork and have reopened this PR at #407
This is a fix by @dmey to avoid losing precision when using cfgrib, which David developed while working on the field compression project, which I’ve taken up.