Unidata / netcdf-c

Official GitHub repository for netCDF-C libraries and utilities.
BSD 3-Clause "New" or "Revised" License
519 stars 262 forks source link

ncdump_tst_netcdf4_4 fails on x86 #2616

Open hjaekel opened 1 year ago

hjaekel commented 1 year ago

I'm trying to package netcdf-c 4.9.1 on Alpine Linux Edge. The tests pass on all platforms except one test: ncdump_tst_netcdf4_4.

43/249 Test  #51: ncdump_tst_ncgen4 .....................   Passed   15.28 sec
        Start  52: ncdump_tst_netcdf4_4
 44/249 Test  #52: ncdump_tst_netcdf4_4 ..................***Failed    0.06 sec
*** Running extra netcdf-4 tests.
*** running tst_string_data to create test files...
*** Testing strings.
*** creating strings test file tst_string_data.nc...ok.
*** Tests successful!
*** dumping tst_string_data.nc to tst_string_data.cdl...
*** comparing tst_string_data.cdl with ref_tst_string_data.cdl...
*** testing reference file ref_tst_compounds2.nc...
*** testing reference file ref_tst_compounds3.nc...
*** testing reference file ref_tst_compounds4.nc...
--- tst_ncf213.tmp
+++ ref_tst_ncf213.tmp
@@ -34,7 +34,7 @@
    obs_t var5(dim1) ;
        var5:_Storage = "chunked" ;
        var5:_ChunkSizes = 6 ;
-       var5:_Filter = "3|2,36|1,2" ;
+       var5:_Filter = "3|2,40|1,2" ;
        var5:_NoFill = "true" ;

 // global attributes:
        Start  53: ncdump_tst_nccopy4
 45/249 Test  #53: ncdump_tst_nccopy4 ....................   Passed    1.31 sec

I use the following statements to compile and execute the tests:

local _enable_cdf5=ON
case "$CARCH" in
    x86|armhf|armv7) _enable_cdf5=OFF ;;
esac
cmake -B build -G Ninja \
    -DCMAKE_INSTALL_PREFIX=/usr \
    -DCMAKE_INSTALL_LIBDIR=lib \
    -DCMAKE_BUILD_TYPE=None \
    -DENABLE_CDF5=$_enable_cdf5 \
    -DENABLE_DAP_LONG_TESTS=ON \
    -DENABLE_EXAMPLE_TESTS=ON \
    -DENABLE_EXTRA_TESTS=ON \
    -DENABLE_FAILING_TESTS=ON \
    -DENABLE_FILTER_TESTING=ON \
    -DENABLE_LARGE_FILE_TESTS=ON
cmake --build build
cd build
CTEST_OUTPUT_ON_FAILURE=1 ctest -E "nc_test4_tst_large2"
DennisHeimbigner commented 1 year ago

This is a known problem. It as to do with running tests in parallel during make check. There is a race condition that we have not yet found. If you re-run make check, the odds are good that it will work.

edwardhartnett commented 1 year ago

Well this can be fixed by adding the right line to Makefile.am. See https://stackoverflow.com/questions/17172310/make-disable-parallel-building-in-subdirectory-for-single-target-only.

hjaekel commented 1 year ago

We use Ninja, so I guess the change in the Makefile will have no effect. I tried with ctest -j 1 with the same test failure than before. Finally I switched to

CTEST_OUTPUT_ON_FAILURE=1 ctest -R "ncdump_tst_netcdf4_4"
CTEST_OUTPUT_ON_FAILURE=1 ctest -E "ncdump_tst_netcdf4_4 nc_test4_tst_large2"

This should prevent race conditions from occurring. However, the test still fails on x86. This is reproducible and only happens on x86. On all other platforms (aarch64, armhf, armv7, ppc64le and x86_64) the test runs successfully. You can see the ci pipelines here: https://gitlab.alpinelinux.org/hjaekel/aports/-/pipelines/153046

mikpos-84 commented 1 month ago

I have the same FAIL in check. (46 PASS and 1 FAIL) I tried to run the script netcdf-c/ncdump/tst_netcdf4_4.sh independently and I think the problem is related to ncgen. The type of filter applied to variable 5 changes : var5:_Filter = "3|2,40|1,2" --> var5:_Filter = "3|2,36|1,2" Could you help me ? Thanks

DennisHeimbigner commented 1 month ago

After a quick look, I think this may be a compound type packing problem. Specifically, the middle filter 2 refers to the shuffle filter. It technically has no argument, but apparently, the size of the compound type is being included as an argument for the filter. So in this case, the baseline file assumes that the compound type size is 40, but on the platform/compiler you are using, it has a size of 36. I will investigate, if I can, if my speculation is correct. Do you know what compiler and compiler version you are using?

mikpos-84 commented 1 month ago

Thanks a lot for the support. The compiler version is GCC 4.4.7 20120313 (Red Hat 4.4.7-18).

DennisHeimbigner commented 1 month ago

That is a pretty old version of gcc, I think. I am not sure we can fix the problem if it is struct type packing issue. Any chance you test against a much more recent version of gcc. Perhaps you have a similar platform with a newer version of gcc?

mikpos-84 commented 1 month ago

I know that the GCC version is very old, but at the moment I can't update it, it's a constraint. From what I understand by running only the test netcdf-c/ncdump/tst_netcdf4_4.sh, I exclude the suspicion of the race condition related to a parallel execution of the tests, and I attribute the fail to the compiler. Is this correct? This would mean that the library compiled in this way is to be considered "corrupted" and could cause problems in use. Thanks.

DennisHeimbigner commented 1 month ago

I have not investigated thoroughly, but yes, in my opinion, the failure is due to a change in the gcc compiler. Presumably as more people use that compiler version, we will start to see reports of similar failures.