Found more tests that pass inconsistently

AlysonStahl-NOAA commented 4 months ago

There are some tests that are occasionally failing. It's a little hard to tell what's going on, but this time its the actual output that's inconsistent, not just EOL or whitespace like in #137. Typically, if I just rerun the checks it will pass the next time.

I have experienced this issue with the following tests:

run_wgrib2_tests.sh:

echo "*** Testing calculation of wind speed, direction, and UGRD & VGRD components" ../wgrib2/wgrib2 data/gdas.t12z.pgrb2.1p00.anl.75r.grib2 -wind_dir wind.grb -wind_speed wind.grb -match "(UGRD|VGRD)" ../wgrib2/wgrib2 wind.grb > wind.txt cat wind.txt diff -w wind.txt data/ref_wind.gdas.t12z.pgrb2.1p00.anl.75r.grib2.inv ../wgrib2/wgrib2 wind.grb -wind_uv uv.grb ../wgrib2/wgrib2 uv.grb > uv.txt cat uv.txt diff -w uv.txt data/ref_uv.gdas.t12z.pgrb2.1p00.anl.75r.grib2.inv

echo "*** Testing spread output" ../wgrib2/wgrib2 data/ref_simple_packing.grib2 -v2 -spread spread.txt touch spread.txt diff -w data/ref_simple_packing.grib2.spread.txt spread.txt

run_wgrib2_rpn_tests.sh

echo "*** Calculates wind speed for records 1-25, then returns the average, min, and max values" ../wgrib2/wgrib2 data/gdas.t12z.pgrb2.1p00.anl.75r.grib2 -for 1:25 -match ":(UGRD|VGRD):" \ -if ":UGRD:" -rpn "sto_1" -fi \ -if ":VGRD:" -rpn "sto_2" -fi \ -if_reg 1:2 \ -rpn "rcl_1:sq:rcl_2:sq:+:sqrt:clr_1:clr_2" \ -set_var WIND \ -grib_out tmp_windspeed.grb

webisu commented 4 months ago

wind_uv calculates the wind u/v components from the wind speed and direction. This involves sin and cos function calls. The exact values of sin and cos of an angle is determined by the math library used by the code and is not standardized.

Calculating the wind direction also involves a call to the atan routine of the math library and the results will be compiler dependent.

So trig functions are compiler depend +,-,*,/ are reproducible of compiled in strict mode often reproducible if ordinary mode less likely to be reproducible in fast_math mode

old versions of wgrib2 were compiled in fast_math mode latest non-github version of wgrib2 was compiled in strict mode otherwise proj4 would fail its internal testing on a Mac github version seems to compile wgrib2 with default optimization.

On Tue, Jul 9, 2024 at 12:44 PM AlysonStahl-NOAA @.***> wrote:

There are some tests that are occasionally failing. It's a little hard to tell what's going on, but this time its the actual output that's inconsistent, not just EOL or whitespace like in #137 https://github.com/NOAA-EMC/wgrib2/issues/137. Typically, if I just rerun the checks it will pass the next time.

I have experienced this issue with the following tests:

run_wgrib2_tests.sh:

echo "*** Testing calculation of wind speed, direction, and UGRD & VGRD components" ../wgrib2/wgrib2 data/gdas.t12z.pgrb2.1p00.anl.75r.grib2 -wind_dir wind.grb -wind_speed wind.grb -match "(UGRD|VGRD)" ../wgrib2/wgrib2 wind.grb > wind.txt cat wind.txt diff -w wind.txt data/ref_wind.gdas.t12z.pgrb2.1p00.anl.75r.grib2.inv ../wgrib2/wgrib2 wind.grb -wind_uv uv.grb ../wgrib2/wgrib2 uv.grb > uv.txt cat uv.txt diff -w uv.txt data/ref_uv.gdas.t12z.pgrb2.1p00.anl.75r.grib2.inv

echo "*** Testing spread output" ../wgrib2/wgrib2 data/ref_simple_packing.grib2 -v2 -spread spread.txt touch spread.txt diff -w data/ref_simple_packing.grib2.spread.txt spread.txt

run_wgrib2_rpn_tests.sh

echo "*** Calculates wind speed for records 1-25, then returns the average, min, and max values" ../wgrib2/wgrib2 data/gdas.t12z.pgrb2.1p00.anl.75r.grib2 -for 1:25 -match ":(UGRD|VGRD):" -if ":UGRD:" -rpn "sto_1" -fi -if ":VGRD:" -rpn "sto_2" -fi -if_reg 1:2 -rpn "rcl_1:sq:rcl_2:sq:+:sqrt:clr_1:clr_2" -set_var WIND -grib_out tmp_windspeed.grb

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/wgrib2/issues/175, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIB7ZVGZ6K5L3TBLO2SW5DZLQHORAVCNFSM6AAAAABKTHUWHSVHI2DSMVQWIX3LMV43ASLTON2WKOZSGM4TQNRZHEYDAOA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

AlysonStahl-NOAA commented 2 months ago

@webisu @edwardhartnett I found more instances of tests failing. These ones are from the ipolates tests, specifically those testing the conversion to different grid definitions. I've seen this occasionally pop up for different grids, not just the same one. For example here:

*** Testing conversion to NCEP grid definition 129 1:0:d=2024042612:UGRD:0.01 mb:anl: .... 14:816827:d=2024042612:VGRD:0.4 mb:anl: 49c49 < unscaled lat=89843514 to -89843514 lon=0 to 359795455:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,7) lvl2=(255,missing):0.07 mb:anl::lon=10.022727,lat=11.959112,i=876530,ix=50,iy=499,val=-1.25732:lon=20.045455,lat=80.034019,i=1462659,ix=99,iy=832,val=9.74268 --- > unscaled lat=89843514 to -89843514 lon=0 to 359795455:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,7) lvl2=(255,missing):0.07 mb:anl::lon=10.022727,lat=11.959112,i=876530,ix=50,iy=499,val=-0.930981:lon=20.045455,lat=80.034019,i=1462659,ix=99,iy=832,val=10.069

Another example for a different grid:

*** Testing conversion to NCEP grid definition 3 1:0:d=2024042612:UGRD:0.01 mb:anl: ... 14:816827:d=2024042612:VGRD:0.4 mb:anl: 66c66 < unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-7:lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9 --- > unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-8:lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9

The actual differences are in the values labeled as 'val'. Looks like this is also related to calculations involving the u/v components for wind, so I'm keeping these under the same issue.

webisu commented 2 months ago

< unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-7 :lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9

unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-8 :lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9

The values are -7 and -8 which are large differences.

How precision works with grib and wgrib. There are two common "precision models" in grib.

value = (base + I * 2*N) 10**M base is float, iIis an integer, and I >= 0, M, N are integers N and M are determined by the field type (ex RH, TMP, HGT) and perhaps by level (specific humidity) This is the NCEP style.

2 value = (base + I * 2**N) I is an integer and I >= 0 N is an integer and is determined by the grib encoding routine to preserve m binary bits of precision. This is the ECMWF style

An uncommon mode is to store the data as ieee numbers and keep the full precision.

-new_grid preserves the N and M quantities. For the file new_grid_test.grb

@.**:~/wgrib2_lib/wgrib2/build/tests$ wgrib2 new_grid_test.grb -v -packing -for 1:3 1:0:packing=Grid point data - simple packing,s val=(-86+i2^0)10^0, i=0..255 (#bits=8) 2:65339:packing=Grid point data - simple packing,s val=(-79+i2^0)10^0, i=0..255 (#bits=8) 3:130678:packing=Grid point data - simple packing,s val=(-66+i2^0)*10^0, i=0..255 (#bits=8)

N and M are zero. So data is stored as "base + integer". In the above example, the base was zero and the grid values were converted to the nearest integer by -new_grid. So a value of -7 or -8 could have occured because the real value was -7.50001 or 7.49999.

For testing, you want to convert the output to ECMWF style by

-set_bin_prec 16

wgrib2 file.grb -set_bin_prec 16 -new_grid_winds earth -new_grid A B C OUT.grb

Now the grid point values will be written out with 16 binary bits of precision and you can quantify the error. BTW if you want more precision than 16, you do

-set_grib_max_bits N -set_bin_prec N N <= 25 but should use N <= 24 for compatibility with other grib packages. Note: the grib standard does support double precision but I am not aware of any package that supports double precision. However there are some fields that should be stored in double precision such as lat/lon and perhaps anything that involves time. There is an upgrade path for dp grid point values that will allow incremental upgrades.

On Fri, Aug 23, 2024 at 1:44 PM AlysonStahl-NOAA @.***> wrote:

@webisu https://github.com/webisu @edwardhartnett https://github.com/edwardhartnett I found more instances of tests failing. These ones are from the ipolates tests, specifically those testing the conversion to different grid definitions. I've seen this occasionally pop up for different grids, not just the same one. For example here:

*** Testing conversion to NCEP grid definition 129 1:0:d=2024042612:UGRD:0.01 mb:anl: .... 14:816827:d=2024042612:VGRD:0.4 mb:anl: 49c49 < unscaled lat=89843514 to -89843514 lon=0 to 359795455:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,7) lvl2=(255,missing):0.07 mb:anl::lon=10.022727,lat=11.959112,i=876530,ix=50,iy=499,val=-1.25732:lon=20.045455,lat=80.034019,i=1462659,ix=99,iy=832,val=9.74268

unscaled lat=89843514 to -89843514 lon=0 to 359795455:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,7) lvl2=(255,missing):0.07 mb:anl::lon=10.022727,lat=11.959112,i=876530,ix=50,iy=499,val=-0.930981:lon=20.045455,lat=80.034019,i=1462659,ix=99,iy=832,val=10.069

Another example for a different grid:

*** Testing conversion to NCEP grid definition 3 1:0:d=2024042612:UGRD:0.01 mb:anl: ... 14:816827:d=2024042612:VGRD:0.4 mb:anl: 66c66 < unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-7:lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9

unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-8:lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9

The actual differences are in the values labeled as 'val'. Looks like this is also related to calculations involving the u/v components for wind, so I'm keeping these under the same issue.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/wgrib2/issues/175#issuecomment-2307537294, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIB7ZUV3M2UHRYHJUYXJQTZS5YJHAVCNFSM6AAAAABKTHUWHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMBXGUZTOMRZGQ . You are receiving this because you were mentioned.Message ID: @.***>

edwardhartnett commented 2 months ago

OK, so these are large differences in the results? Too large to be acceptable?

Are they happening in wgrib2 or ip? @AlexanderRichert-NOAA you may wish to add or check your testing for this.

AlysonStahl-NOAA commented 2 months ago

The other failing tests do not use ip, but they do involve the wind calculations. I'm thinking the error is in there.

webisu commented 2 months ago

Wind_speed uses a sqrt to calculate the wind speed. Sqrt is a machine instruction so should be independent of the library assuming a x86 machine. WInd_dir uses a call to atan(..) which depends on a compiler library. The wind_dir will calculate the wind angle to the nearest degree. So if the angle ix XX.5 degrees could cause problems depending if it rounds up or down.

In the past when I have compared the results of the GNU and Intel compilers, the interpolation was the 1st problem with my automated testing. I never looked deeper.

On Mon, Aug 26, 2024 at 3:29 PM AlysonStahl-NOAA @.***> wrote:

The other failing tests do not use ip, but they do involve the wind calculations. I'm thinking the error is in there.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/wgrib2/issues/175#issuecomment-2310487783, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEIB7ZUYBMO657JAWWADLTDZTNCTVAVCNFSM6AAAAABKTHUWHSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJQGQ4DONZYGM . You are receiving this because you were mentioned.Message ID: @.***>

edwardhartnett commented 2 months ago

If this is correct behavior, increase the epsilon.

NOAA-EMC / wgrib2

Found more tests that pass inconsistently #175

< unscaled lat=90000000 to -90000000 lon=0 to 359000000:12Z26apr2024:UGRD U-Component of Wind [m/s]:lvl1=(100,20) lvl2=(255,missing):0.2 mb:anl::lon=10.000000,lat=12.000000,i=36731,ix=11,iy=103,val=-7 :lon=20.000000,lat=80.000000,i=61221,ix=21,iy=171,val=9