Spurious failures with test.67

hmaarrfk commented 2 years ago

To report a non-security related issue, please provide:

the version of the software with which you are encountering an issue: 4.8.1
environmental information (i.e. Operating System, compiler info, java version, python version, etc.): Recent conda-forge environment
a description of the issue with the steps needed to reproduce it: Build and run tests.

For some reasons, it seems that test.67 is failing due to rounding errors.

Is there a way to relax the tolerance, or to skip the test in CIs while the fix is implemented?

Thank you very much for your help

2022-01-18T22:49:55.2133446Z 153: Parallel Performance Test for NASA
2022-01-18T22:49:55.3589607Z 153: *** Testing parallel IO for NASA...
2022-01-18T22:49:55.3590559Z 153: num_proc  MPI mode    access      cache (MB)  grid size   chunks  avg. write time(s)  avg. write bandwidth(MB/s)  num_tries
2022-01-18T22:49:56.5404167Z 153: 4     MPI-IO      independent 1       40x61       contiguous  0.140169        0.664239            2
2022-01-18T22:49:56.6507536Z 187: *** Testing: test.21 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.21
2022-01-18T22:49:57.7011207Z 187: *** Testing: test.50 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.50
2022-01-18T22:49:58.8266840Z 187: *** Testing: test.53 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.53
2022-01-18T22:49:59.5764629Z 153: 4     MPI-IO      independent 1       40x61       61x40       0.135389        0.689269            2
2022-01-18T22:50:00.0023322Z 187: *** Testing: test.55 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.55
2022-01-18T22:50:01.4301522Z 187: *** Testing: test.56 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.56
2022-01-18T22:50:01.9445081Z 153: 4     MPI-IO      independent 1       40x61       31x20       0.096213        1.110586            2
2022-01-18T22:50:02.6089203Z 187: *** Testing: test.57 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.57
2022-01-18T22:50:03.8811705Z 187: *** Testing: test.66 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.66
2022-01-18T22:50:04.8067202Z 187: *** Testing: test.67 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.67
2022-01-18T22:50:05.7020792Z 187: 41c41
2022-01-18T22:50:05.7021613Z 187: <     0.589788025031098, 0.581683089463883, 0.573519986072457, 
2022-01-18T22:50:05.7022702Z 187: ---
2022-01-18T22:50:05.7023192Z 187: >     0.589788025031098, 0.581683089463884, 0.573519986072457, 
2022-01-18T22:50:05.7092990Z 187: *** FAIL:  test.67
2022-01-18T22:50:05.7361715Z 153: 4     MPI-IO      collective  1       40x61       contiguous  1.390005        0.066968            2
2022-01-18T22:50:05.8327858Z 187: *** Testing: test.68 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.68
2022-01-18T22:50:06.8210803Z 187: *** Testing: test.69 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.69
2022-01-18T22:50:08.6198212Z 187: *** Testing: test.01.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.01?f64
2022-01-18T22:50:09.1583404Z 187: *** Testing: test.02.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.02?b[1:2:10]
2022-01-18T22:50:10.2771125Z 187: *** Testing: test.03.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.03?i32[0:1][1:2][0:2]
2022-01-18T22:50:11.4927304Z 153: 4     MPI-IO      collective  1       40x61       61x40       1.583846        0.059221            2
2022-01-18T22:50:11.5621661Z 187: *** Testing: test.04.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.04?types.i32
2022-01-18T22:50:12.6384408Z 187: *** Testing: test.05.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.05?types.floats.f32
2022-01-18T22:50:13.6222597Z 187: *** Testing: test.07.1 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.07?person.age
2022-01-18T22:50:14.7774280Z 187: *** Testing: test.07.3 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.07?person
2022-01-18T22:50:16.0414786Z 187: *** Testing: test.07.4 ; url=[cache][prefetch]https://remotetest.unidata.ucar.edu/dts/test.07?types.f32

See full logs: https://github.com/conda-forge/libnetcdf-feedstock/pull/135 test.67_failure_example.txt

hmaarrfk commented 2 years ago

Tests are being run with:

ctest -VV --output-on-failure -j${CPU_COUNT} ${SKIP}

WardF commented 2 years ago

Tagging @DennisHeimbigner since this is DAP related and there might be something related to that in my blindspot.

DennisHeimbigner commented 2 years ago

This kind of problem occurs in a number of tests that involve float or double comparisons. There is no general fix, although sometimes the ncdump -p flag can be used. As a rule, we have just made the test conditional so that it is not run when this failure is known to occur. Is this being run under windows or mingw?

hmaarrfk commented 2 years ago

No currently being in linux.

Can you please provide me the syntax to skip the tests with ctest I'm not too familiar with your build/test system.

Thank you!

WardF commented 2 years ago

I would need to look up the cmake syntax; currently, test.67 appears to be part of the broader test suite, running as part of test number 187. The numbering of individual tests will vary depending on the underlying configuration options specified; can you tell me what the name of test 187 in this configuration is? There is a cmake/ctest flag we should be able to specify to exclude that test by name.

WardF commented 2 years ago

You can turn off remote DAP testing all together by invoking cmake with the -DENABLE_DAP_REMOTE_TESTS=OFF command-line argument.

hmaarrfk commented 2 years ago

Thanks!

verduijn commented 2 years ago

Just wanted to add another data point. I've seen the same issue while building from source using Docker. As a base amazonlinux:2 is used with gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-13) and after building netcdf-c v4.6.1 the make check fails on the same error. Previously this image build fine. My "fix" was to add --disable-dap-remote-tests to the ./configure step.

Dave-Allured commented 2 years ago

I have also seen the same test.67 failure in several recent builds, all on Mac OS 10.15. These are the only versions that I tested, so other versions are probably also affected in the same way.

4.3.3.1
4.4.1.1
4.6.3
4.8.1 release
recent development snapshots

These failures occurred consistently between 2021 December 23 and 2022 March 6. I have some old testing leftovers and other evidence indicating test.67 was working fine for many years, from inception through 2020 February 26. I have no evidence for the gap from 2020 March through 2021 November. In other words, I can't pin down closely when this started.

From this, I think it is most likely that there was a recent change in behavior in handling test.67 on Unidata's test server, and that there is no particular problem within the local test code in netcdf-c.

ArchangeGabriel commented 2 years ago

I can fill the gap a lot then: on Arch, it worked up to the 2021 October 1st builds (including builds in 2021 July for instance). After that I don’t have builds before this February, so you’ve got a closer point, but with the two of us we can already narrow it down a lot.

Dave-Allured commented 2 years ago

@ArchangeGabriel, thank you, this is helpful. So it seems the relevant change in behavior was between 2021 October 1 and 2021 December 23.

I would like to look into this more closely. Can someone at Unidata please show me how to get a direct copy of the original netCDF data file for test.67, NOT through OpenDAP? Also can you please show me how to access the server-side OpenDAP support code, and its recent change history on remotetest.unidata.ucar.edu? Are you using Unidata/tds, or something else?

DennisHeimbigner commented 2 years ago

The test.67 is produced by the WAR file constructed by the code in this github repository: https://github.com/Unidata/Tds. Specifically, the path is tds/opendap/dtswar. This code is quite old.

The process for accessing e.g. test.67 is a bit complicated. The .dds file is ./src/main/webapp/WEB-INF/resources/testdatasets/dds/test.67. However, the server sends out a synthetic .dods file and specifically the data for 64-bit floats in that .dds are synthesized by generating a sequence of doubles in the file tds/opendap/dtswar/src/main/java/opendap/dts/testEngine.java. and specifically in the function nextFloat64(), which looks like this:

public double nextFloat64() {
    double b = (double) 1000 * Math.cos(tFloat64);

    tFloat64 += 0.01;
    return (b);
}

where the starting value for tFloat64 is 0.0. So, the sequence of values is

1000*cos(0.0)
1000cos(1000cos(# 1)) ... n. 1000cos(1000cos(#n-1))

So perhaps the issue is the cosine function? (seems unlikely)

Dave-Allured commented 2 years ago

Synthesized! I had not really thought about that. Well, I see at least two opportunities for roundoff deviation to creep in. Yes, cosine is one of them. tFloat64 += 0.01 is another because 0.01 is not exactly representable in base 2. Possible causes of deviations are change in hosting CPU type, compiler or runtime math optimizations, and runtime code fixes and improvements. On reflection I think cosine is the more likely culprit. All those Taylor series with their vanishing fractions...

This still does not rule out the possibility of some deviation in the way OpenDAP is communicating the synthesized data to the client.

Rather than diving deeper, I suggest a tentative easy solution that may be good for the long run. Generate a one-time actual netCDF file that will output the expected test.67 pattern through OpenDAP. Install this netCDF file in place of synthesis on remotetest.unidata. Then the future is insulated from possible new math deviations within synthesis. Also, if crafted properly, then all of the old and current distributed netCDF code release versions will start working properly again. As a further benefit, this would exercise more of the active OpenDAP code on the server side.

Attached is one such crafted netCDF file that should regenerate the expected test.67 pattern, for your consideration. Also included is the full-precision CDL file that generated the netCDF file using ncgen. test.67.regen4.zip

DennisHeimbigner commented 2 years ago

Changing the dts war to serve up a fixed file for a single test is, I suspect, a rather large task. It would be easier to do it for all test files, but that too is still a large task. It is worth remembering that the reason for remotetest is to check out the actual operation of the OPeNDAP protocol with a real server. The actual data is not particularly important. So we could modify test.67 to generate all integers instead of double values.

Unidata / netcdf-c

Spurious failures with test.67 #2188