[Bug] Test failures on Mac M1 (arm64 architecture)

srherbener commented 11 months ago

Current behavior (describe the bug)

I'm seeing the following test failures on my Mac M1 when using the develop branch:

  1 706:test_iodaconv_bufr_ncep_1bamua2ioda
  2 707:test_iodaconv_bufr_ncep_1bamua2ioda_n15
  3 708:test_iodaconv_bufr_ncep_esamua2ioda
  4 712:test_iodaconv_prepbufr_ncep_api_adpsfc2ioda
  5 713:test_iodaconv_prepbufr_ncep_api_sfcshp2ioda
  6 725:test_iodaconv_bufr_ncep_mtiasi
  7 726:test_iodaconv_bufr_ncep_atms
  8 732:test_iodaconv_bufr_ncep_prepbufr_adpupa_api
  9 753:test_iodaconv_prepbufr_conv
 10 754:test_iodaconv_mhs_conv
 11 755:test_iodaconv_amsua_conv

These are a mix of what looks like precision/tolerance issues, fill values and shared library load issues:

30143 DIFFER : VARIABLE : sensorViewAngle : POSITION : [818] : VALUES : 31.659 <> 31.659
30144 DIFFER : VARIABLE : sensorViewAngle : POSITION : [834] : VALUES : -15.003 <> -15.003
30145 DIFFER : VARIABLE : sensorViewAngle : POSITION : [835] : VALUES : -11.67 <> -11.67
30146 DIFFER : VARIABLE : sensorViewAngle : POSITION : [836] : VALUES : -8.337 <> -8.337
30147 DIFFER : VARIABLE : sensorViewAngle : POSITION : [837] : VALUES : -5.004 <> -5.004
30148 DIFFER : VARIABLE : sensorViewAngle : POSITION : [838] : VALUES : -1.671 <> -1.671
30149 DIFFER : VARIABLE : sensorViewAngle : POSITION : [839] : VALUES : 1.662 <> 1.662
30150 DIFFER : VARIABLE : sensorViewAngle : POSITION : [843] : VALUES : 14.994 <> 14.994
30151 DIFFER : VARIABLE : sensorViewAngle : POSITION : [844] : VALUES : 18.327 <> 18.327
30152 DIFFER : VARIABLE : sensorViewAngle : POSITION : [845] : VALUES : 21.66 <> 21.66
30153 DIFFER : VARIABLE : sensorViewAngle : POSITION : [846] : VALUES : 24.993 <> 24.993
30154 DIFFER : VARIABLE : sensorViewAngle : POSITION : [847] : VALUES : 28.326 <> 28.326
30155 Variable        Group     Count         Sum      AbsSum          Min        Max       Range              Mean      StdDev
30156 sensorViewAngle /MetaData   342 0.000457525 0.000510931 -9.53674e-07 3.8147e-06 4.76837e-06 1.      33779e-06 1.13627e-06
30157 <end of output>
30158 Test time =   0.32 sec
30159 ----------------------------------------------------------
30160 Test Failed.
30161 "test_iodaconv_bufr_ncep_1bamua2ioda" end time: Sep 26 13:23 MDT
30162 "test_iodaconv_bufr_ncep_1bamua2ioda" time elapsed: 00:00:00
...
30708 Create Quality Marker group
30709 Write data to variables   
30710 end   
30711 DIFFER : VARIABLE : dateTime : ATTRIBUTE : _FillValue : VALUES : 9223372036854775807 <> -92233      72036854775808
30712 <end of output>
30713 Test time =   0.39 sec
30714 ----------------------------------------------------------       
30715 Test Failed.   
30716 "test_iodaconv_prepbufr_ncep_api_sfcshp2ioda" end time: Sep 26 13:23 MDT
30717 "test_iodaconv_prepbufr_ncep_api_sfcshp2ioda" time elapsed: 00:00:00
...
753/939 Testing: test_iodaconv_prepbufr_conv
753/939 Test: test_iodaconv_prepbufr_conv
Command: "/opt/homebrew/bin/bash" "/Users/stephenh/projects/CONVERTERS/ioda-bundle/build/bin/iodacon
v_comp.sh" "netcdf" "/Users/stephenh/projects/CONVERTERS/ioda-bundle/build/bin/bufr2nc_fortran.x
                            -i testinput -o testrun prepbufr.bufr" "sondes_obs_2020093018.nc4" "0.5e
-4"
Directory: /Users/stephenh/projects/CONVERTERS/ioda-bundle/build/iodaconv/test
"test_iodaconv_prepbufr_conv" start time: Sep 26 13:24 MDT
Output:
----------------------------------------------------------
dyld[87352]: dyld cache '(null)' not loaded: syscall to map cache into shared region failed
dyld[87352]: Library not loaded: /usr/lib/libSystem.B.dylib
  Referenced from: <5521570B-21CF-378B-ADF6-DEDFE1A83DC7> /Users/stephenh/projects/CONVERTERS/ioda-b
undle/build/bin/bufr2nc_fortran.x
  Reason: tried: '/Users/stephenh/spack-stack/envs/spack-stack-1.5.0/install/apple-clang/14.0.3/nco-
5.0.6-5lllupx/lib/libSystem.B.dylib' (no such file), '/Users/stephenh/spack-stack/envs/spack-stack-1
.5.0/install/apple-clang/14.0.3/gsl-2.7.1-vgewipw/lib/libSystem.B.dylib' (no such file), '/Users/ste
phenh/spack-stack/envs/spack-stack-1.5.0/install/apple-clang/14.0.3/antlr-2.7.7-ac3hdng/lib/libSyste
m.B.dylib' (no such file), '/Users/stephenh/spack-stack/envs/spack-stack-1.5.0/install/apple-clang/1
4.0.3/fms-release-jcsda-et4rhdk/lib/libSystem.B.dylib' (no such file), ...
'/Users/stephenh/spack-stack/envs/spack-stack-1.5.0/install/apple-clang/14.0.3/zstd-1.5.2-wbngzlm/lib/libSystem.B.dylib' (no such file)
/Users/stephenh/projects/CONVERTERS/ioda-bundle/build/bin/iodaconv_comp.sh: line 22: 87352 Abort trap: 6           $cmd
<end of output>
Test time =   0.06 sec
----------------------------------------------------------
Test Failed.
"test_iodaconv_prepbufr_conv" end time: Sep 26 13:24 MDT
"test_iodaconv_prepbufr_conv" time elapsed: 00:00:00
----------------------------------------------------------

To Reproduce

What computer are you running on?

Mac M1

arm64

What compilers/modules are you using?

apple-clang@14.0.3

openmpi@4.1.5

spack-stack-1.5.0

Steps to reproduce the behavior

clone ioda-bundle
build ioda-bundle
ctest -R iodaconv

Expected behavior

All tests pass.

Additional information (optional)

This may be a combination of ioda-converters, ioda and spack-stack issues.

PRs that address this issue

This issue can be closed after the following four PRs are merged into develop:

[x] #1392
[x] #1395
[x] #1397
[x] #1399

PatNichols commented 11 months ago

@srherbener Have you looked at the actual difference in the numbers ? Having a tolerance of zero is a recipe for failing tests. The best we should be able to. do is around an ulp (FLOAT_EPSILON) for different architectures. It would be a very simple fix PR.

PatNichols commented 11 months ago

Note intel uses 80 bit double precision registers for non-avx float operations while I am fairly confident ARM is 64 bit.

PatNichols commented 11 months ago

An estimate of the error is sqrt(# of floating ops) * epsilon where epsilon is approx 1.xxe-7

PatNichols commented 11 months ago

@srherbener in the test/CMakeLists.txt change line 453 to something like: set(IODA_CONV_COMP_TOL_ZERO "1.e-6") See if the tests pass. My own personal rant is that should not have been 0. That's insane

srherbener commented 11 months ago

I agree that a zero tolerance is not particularly useful for floating point values. There is another setting, IODA_CONV_COMP_TOL that is for comparing values with a more reasonable tolerance for single precision floating point values.

The two settings (IODA_CONV_COMP_TOL_ZERO and IODA_CONV_COMP_TOL) came from a request long ago to check using zero tolerance for converters that simply copied values from the input file to the output file (with no intermediate calculations). This was done to check that the copy functions were working properly. The non zero tolerance was added to cover converters that did any kind of intermediate calculation.

But as you mention, the expectation of merely copying data with a zero tolerance is questionable due to different hardware floating point implementations (precisions).

I think we have two options:

Change IODA_CONV_COMP_TOL_ZERO from 0.0 to 1.e-6 as you suggest. If we go with this, we should probably change the name to IODA_CONV_COMP_TOL_COPY or something similar just to be more mnemonic.
Delete IODA_CONV_COMP_TOL_ZERO and change everything to use IODA_CONV_COMP_TOL.

Do you have a preference of one of the options? I'm weakly leaning toward the second option just to keep things simple, but I'm okay either way.

Thanks!

srherbener commented 11 months ago

Here is the status of this work:

PR #1392 Fixes the following tests by adjusting the floating point tolerance:

1 706:test_iodaconv_bufr_ncep_1bamua2ioda
2 707:test_iodaconv_bufr_ncep_1bamua2ioda_n15
3 708:test_iodaconv_bufr_ncep_esamua2ioda
6 725:test_iodaconv_bufr_ncep_mtiasi
7 726:test_iodaconv_bufr_ncep_atms

During testing, it was discovered that this test exhibited intermittent failures:

test_iodaconv_obserror It turns out that an uninitialized variable was being used for a unit number in a Fortran file open call. PR #1395 fixes this issue

Of the remaining tests, these tests fail because of an incorrect fill value on the MetaData/dateTime variable:

4 712:test_iodaconv_prepbufr_ncep_api_adpsfc2ioda
5 713:test_iodaconv_prepbufr_ncep_api_sfcshp2ioda
8 732:test_iodaconv_bufr_ncep_prepbufr_adpupa_api

And these tests fail due to a program crash (abort):

9 753:test_iodaconv_prepbufr_conv
10 754:test_iodaconv_mhs_conv
11 755:test_iodaconv_amsua_conv

srherbener commented 11 months ago

For the tests that fail due to incorrect dateTime fill value, I have discovered that the issue is related to the automatic data type conversion in numpy. Here is an example using the test_iodaconv_prepbufr_ncep_api_adpsfc2ioda test.

The dateTime is coming from the bufr data in DHR which is a float32 value that represents an offset in hours from the DA cycle time. So the code converts the DHR data to seconds in an int64, and then adjusts those values according to the cycle time and the epoch. The data conversion goes okay since the original float values converted to seconds are well within the range of an int64 type.

However, these are numpy masked arrays which contain a specified fill_value. The float32 fill value appears to be picked up from the bufr converter code as std::numeric_limits<float>::max() which is a value (3.4028235e+38) outside the range of an int64. The conversion of this fill_value from float32 to int64 creates an overflow, but that just silently completes and the result is platform dependent. On Orion the result is -92233... whereas on the Mac the result is +92233... which leads to the test failure.

Here is the adpsfc python script excerpts following the dhr (dateTime) conversion:

https://github.com/JCSDA-internal/ioda-converters/blob/b89a30007237311408846c8492873048fa4d1f21/test/testinput/prepbufr_adpsfc_api.py#L20

https://github.com/JCSDA-internal/ioda-converters/blob/b89a30007237311408846c8492873048fa4d1f21/test/testinput/prepbufr_adpsfc_api.py#L50

dhr is a float32 at this point with offset values in hours. The dhr fill_value is set to the max float32 value.

Here is the conversion from hours to seconds followed by the conversion of float32 to int64:

https://github.com/JCSDA-internal/ioda-converters/blob/b89a30007237311408846c8492873048fa4d1f21/test/testinput/prepbufr_adpsfc_api.py#L54

This is where the fault occurs.

srherbener commented 11 months ago

@PatNichols, @rmclaren I'm looking for some guidance in how to address this fault. There are two issues at play:

The ctests are failing because of the conversion issue described above
If we ever encounter a missing value in a variable that gets converted to an int64 type like this, the converted data values as well as the converted fill_value are going to suffer the same problem. Also, there might be some cases other than this that are at risk.

Is it acceptable to split this into two step:

get the tests passing on all platforms
fix the data conversion issue in a robust manner

I think step 2 is fairly involved. It's probably either the selection of numeric fill values that are in range for int32, int64, float32 and float64 which allows the automatic conversions to work, or supply conversion routines in the bufr converter code that are "missing value aware" and give access to these in the python interface. By "missing value aware" I mean that when a missing value is encountered in the source data, it is simply substituted with the destination data missing value (ie, no numeric conversion is performed).

All of the test failures involving the dateTime fill value issue, could be fixed by just reassigning the fill value after the float32 to int64 conversion. This is definitely a hack, but it should be relatively safe since the offset values in seconds should stay well within the int64 range, and we probably don't expect missing values in the dateTime variable.

What do you think? Is the assumption that missing values won't show up in DHR any good? Is reassigning the fill value an acceptable intermediate fix, or should we try to be more robust? I'm open to suggestions about how step 1 should be addressed.

Thanks!

rmclaren commented 11 months ago

Which branch is this on (develop?). Does this problem still exit in the feature/query_new_result branch? I think I might have addressed a similar problem (don't remember exactly)...

srherbener commented 11 months ago

The issue is on the develop branch. I think the issue still exists in the feature/query_new_result branch:

https://github.com/JCSDA-internal/ioda-converters/blob/6ae5b34a975a7fb3d200e7f67b00a99ad0bd1f77/src/bufr/DataObject.h#L230

I'm not 100% sure I've traced this correctly, but when this is executed:

https://github.com/JCSDA-internal/ioda-converters/blob/b89a30007237311408846c8492873048fa4d1f21/test/testinput/prepbufr_adpsfc_api.py#L50)

does it get its fill value set from this:

https://github.com/JCSDA-internal/ioda-converters/blob/6ae5b34a975a7fb3d200e7f67b00a99ad0bd1f77/src/bufr/DataObject.h#L230

rmclaren commented 11 months ago

The line dhr = r.get('obsTimeMinusCycleTime') should be dhr = r.get('obsTimeMinusCycleTime', type='int64'). This will force the type to be int64, and associatted fill_value to be the int64 fill_value.

rmclaren commented 11 months ago

@srherbener The answer to your question is yes (std::numeric_limits<T>::max() gives the missing value), however, the correct way to fix this is to explicitly state (overrride) the type for the DHR field as the meta data for this field is that of a floating point value which is why it was read out that way.

rmclaren commented 11 months ago

Please note that this change will mean we will need to update the testoutput file as well...

srherbener commented 11 months ago

@rmclaren thank you for your advice! I totally agree with your suggestion. I'll create a PR for that.

srherbener commented 11 months ago

@rmclaren one hitch is that the conversion from hours to seconds needs to be done before converting to an int64 type. Is there a way to do that in the "get" function?

rmclaren commented 11 months ago

@srherbener So you are saying that the hours are actually floats and we don't want to truncate the fractional part of the number...

Ok so what is being returned by the get function is a numpy masked array (https://numpy.org/doc/stable/reference/maskedarray.generic.html).

This means you can do something like this:

dhr = (r.get('obsTimeMinusCycleTime')*3600).astype(np.int64)  # cycle time in seconds as int64
dhr[dhr.mask] = -1
dhr.fill_value = -1

Alternatively you could:


np.ma.set_fill_value(dhr, -1)

# then when you write the result
datetime.writeNPArray.int64(dhr.filled().flattened())   # fills masked values with the fill value

srherbener commented 11 months ago

@rmclaren that's correct, the float values include fractions of hours so they need to be converted to seconds before truncating.

I think I have had a misconception about the fill value. You are saying that the masked array has a builtin mask that indicates invalid data values (and that is independent of the data type). The fill value is only used when you call the filled() function to replace the invalid values (marked by the mask) with what the fill value is set to. Is this correct?

I'll read up on the masked array too.

Thanks!

rmclaren commented 11 months ago

@srherbener Correct. If you print the array to the output you will see these parts of the data structure... Calling "filled()" is probably the more correct way to handle things...

srherbener commented 11 months ago

I agree that calling filled() is needed before writing into the output ioda file.

srherbener commented 11 months ago

All four related PRs have been merged, so I will close this as completed.

JCSDA-internal / ioda-converters