Closed srherbener closed 11 months ago
@srherbener Have you looked at the actual difference in the numbers ? Having a tolerance of zero is a recipe for failing tests. The best we should be able to. do is around an ulp (FLOAT_EPSILON) for different architectures. It would be a very simple fix PR.
Note intel uses 80 bit double precision registers for non-avx float operations while I am fairly confident ARM is 64 bit.
An estimate of the error is sqrt(# of floating ops) * epsilon where epsilon is approx 1.xxe-7
@srherbener in the test/CMakeLists.txt change line 453 to something like: set(IODA_CONV_COMP_TOL_ZERO "1.e-6") See if the tests pass. My own personal rant is that should not have been 0. That's insane
I agree that a zero tolerance is not particularly useful for floating point values. There is another setting, IODA_CONV_COMP_TOL that is for comparing values with a more reasonable tolerance for single precision floating point values.
The two settings (IODA_CONV_COMP_TOL_ZERO and IODA_CONV_COMP_TOL) came from a request long ago to check using zero tolerance for converters that simply copied values from the input file to the output file (with no intermediate calculations). This was done to check that the copy functions were working properly. The non zero tolerance was added to cover converters that did any kind of intermediate calculation.
But as you mention, the expectation of merely copying data with a zero tolerance is questionable due to different hardware floating point implementations (precisions).
I think we have two options:
Do you have a preference of one of the options? I'm weakly leaning toward the second option just to keep things simple, but I'm okay either way.
Thanks!
Here is the status of this work:
PR #1392 Fixes the following tests by adjusting the floating point tolerance:
During testing, it was discovered that this test exhibited intermittent failures:
Of the remaining tests, these tests fail because of an incorrect fill value on the MetaData/dateTime variable:
And these tests fail due to a program crash (abort):
For the tests that fail due to incorrect dateTime fill value, I have discovered that the issue is related to the automatic data type conversion in numpy. Here is an example using the test_iodaconv_prepbufr_ncep_api_adpsfc2ioda
test.
The dateTime is coming from the bufr data in DHR which is a float32 value that represents an offset in hours from the DA cycle time. So the code converts the DHR data to seconds in an int64, and then adjusts those values according to the cycle time and the epoch. The data conversion goes okay since the original float values converted to seconds are well within the range of an int64 type.
However, these are numpy masked arrays which contain a specified fill_value. The float32 fill value appears to be picked up from the bufr converter code as std::numeric_limits<float>::max()
which is a value (3.4028235e+38) outside the range of an int64. The conversion of this fill_value from float32 to int64 creates an overflow, but that just silently completes and the result is platform dependent. On Orion the result is -92233... whereas on the Mac the result is +92233... which leads to the test failure.
Here is the adpsfc python script excerpts following the dhr (dateTime) conversion:
dhr is a float32 at this point with offset values in hours. The dhr fill_value is set to the max float32 value.
Here is the conversion from hours to seconds followed by the conversion of float32 to int64:
This is where the fault occurs.
@PatNichols, @rmclaren I'm looking for some guidance in how to address this fault. There are two issues at play:
Is it acceptable to split this into two step:
I think step 2 is fairly involved. It's probably either the selection of numeric fill values that are in range for int32, int64, float32 and float64 which allows the automatic conversions to work, or supply conversion routines in the bufr converter code that are "missing value aware" and give access to these in the python interface. By "missing value aware" I mean that when a missing value is encountered in the source data, it is simply substituted with the destination data missing value (ie, no numeric conversion is performed).
All of the test failures involving the dateTime fill value issue, could be fixed by just reassigning the fill value after the float32 to int64 conversion. This is definitely a hack, but it should be relatively safe since the offset values in seconds should stay well within the int64 range, and we probably don't expect missing values in the dateTime variable.
What do you think? Is the assumption that missing values won't show up in DHR any good? Is reassigning the fill value an acceptable intermediate fix, or should we try to be more robust? I'm open to suggestions about how step 1 should be addressed.
Thanks!
Which branch is this on (develop?). Does this problem still exit in the feature/query_new_result branch? I think I might have addressed a similar problem (don't remember exactly)...
The issue is on the develop branch. I think the issue still exists in the feature/query_new_result branch:
I'm not 100% sure I've traced this correctly, but when this is executed:
does it get its fill value set from this:
The line dhr = r.get('obsTimeMinusCycleTime')
should be dhr = r.get('obsTimeMinusCycleTime', type='int64')
. This will force the type to be int64, and associatted fill_value to be the int64 fill_value.
@srherbener The answer to your question is yes (std::numeric_limits<T>::max()
gives the missing value), however, the correct way to fix this is to explicitly state (overrride) the type for the DHR
field as the meta data for this field is that of a floating point value which is why it was read out that way.
Please note that this change will mean we will need to update the testoutput file as well...
@rmclaren thank you for your advice! I totally agree with your suggestion. I'll create a PR for that.
@rmclaren one hitch is that the conversion from hours to seconds needs to be done before converting to an int64 type. Is there a way to do that in the "get" function?
@srherbener So you are saying that the hours are actually floats and we don't want to truncate the fractional part of the number...
Ok so what is being returned by the get function is a numpy masked array (https://numpy.org/doc/stable/reference/maskedarray.generic.html).
This means you can do something like this:
dhr = (r.get('obsTimeMinusCycleTime')*3600).astype(np.int64) # cycle time in seconds as int64
dhr[dhr.mask] = -1
dhr.fill_value = -1
Alternatively you could:
np.ma.set_fill_value(dhr, -1)
# then when you write the result
datetime.writeNPArray.int64(dhr.filled().flattened()) # fills masked values with the fill value
@rmclaren that's correct, the float values include fractions of hours so they need to be converted to seconds before truncating.
I think I have had a misconception about the fill value. You are saying that the masked array has a builtin mask that indicates invalid data values (and that is independent of the data type). The fill value is only used when you call the filled() function to replace the invalid values (marked by the mask) with what the fill value is set to. Is this correct?
I'll read up on the masked array too.
Thanks!
@srherbener Correct. If you print the array to the output you will see these parts of the data structure... Calling "filled()" is probably the more correct way to handle things...
I agree that calling filled() is needed before writing into the output ioda file.
All four related PRs have been merged, so I will close this as completed.
Current behavior (describe the bug)
I'm seeing the following test failures on my Mac M1 when using the develop branch:
These are a mix of what looks like precision/tolerance issues, fill values and shared library load issues:
To Reproduce
ctest -R iodaconv
Expected behavior
All tests pass.
Additional information (optional)
This may be a combination of ioda-converters, ioda and spack-stack issues.
PRs that address this issue
This issue can be closed after the following four PRs are merged into develop: