Open hmaarrfk opened 2 years ago
Tests are being run with:
ctest -VV --output-on-failure -j${CPU_COUNT} ${SKIP}
Tagging @DennisHeimbigner since this is DAP related and there might be something related to that in my blindspot.
This kind of problem occurs in a number of tests that involve float or double comparisons. There is no general fix, although sometimes the ncdump -p flag can be used. As a rule, we have just made the test conditional so that it is not run when this failure is known to occur. Is this being run under windows or mingw?
No currently being in linux.
Can you please provide me the syntax to skip the tests with ctest
I'm not too familiar with your build/test system.
Thank you!
I would need to look up the cmake syntax; currently, test.67 appears to be part of the broader test suite, running as part of test number 187
. The numbering of individual tests will vary depending on the underlying configuration options specified; can you tell me what the name of test 187 in this configuration is? There is a cmake/ctest flag we should be able to specify to exclude that test by name.
You can turn off remote DAP
testing all together by invoking cmake
with the -DENABLE_DAP_REMOTE_TESTS=OFF
command-line argument.
Thanks!
Just wanted to add another data point. I've seen the same issue while building from source using Docker.
As a base amazonlinux:2
is used with gcc (GCC) 7.3.1 20180712 (Red Hat 7.3.1-13)
and after building netcdf-c v4.6.1
the make check
fails on the same error. Previously this image build fine. My "fix" was to add --disable-dap-remote-tests
to the ./configure
step.
I have also seen the same test.67 failure in several recent builds, all on Mac OS 10.15. These are the only versions that I tested, so other versions are probably also affected in the same way.
These failures occurred consistently between 2021 December 23 and 2022 March 6. I have some old testing leftovers and other evidence indicating test.67 was working fine for many years, from inception through 2020 February 26. I have no evidence for the gap from 2020 March through 2021 November. In other words, I can't pin down closely when this started.
From this, I think it is most likely that there was a recent change in behavior in handling test.67 on Unidata's test server, and that there is no particular problem within the local test code in netcdf-c.
I can fill the gap a lot then: on Arch, it worked up to the 2021 October 1st builds (including builds in 2021 July for instance). After that I don’t have builds before this February, so you’ve got a closer point, but with the two of us we can already narrow it down a lot.
@ArchangeGabriel, thank you, this is helpful. So it seems the relevant change in behavior was between 2021 October 1 and 2021 December 23.
I would like to look into this more closely. Can someone at Unidata please show me how to get a direct copy of the original netCDF data file for test.67, NOT through OpenDAP? Also can you please show me how to access the server-side OpenDAP support code, and its recent change history on remotetest.unidata.ucar.edu? Are you using Unidata/tds, or something else?
The test.67 is produced by the WAR file constructed by the code in this github repository: https://github.com/Unidata/Tds. Specifically, the path is tds/opendap/dtswar. This code is quite old.
The process for accessing e.g. test.67 is a bit complicated. The .dds file is ./src/main/webapp/WEB-INF/resources/testdatasets/dds/test.67. However, the server sends out a synthetic .dods file and specifically the data for 64-bit floats in that .dds are synthesized by generating a sequence of doubles in the file tds/opendap/dtswar/src/main/java/opendap/dts/testEngine.java. and specifically in the function nextFloat64(), which looks like this:
public double nextFloat64() {
double b = (double) 1000 * Math.cos(tFloat64);
tFloat64 += 0.01;
return (b);
}
where the starting value for tFloat64 is 0.0. So, the sequence of values is
So perhaps the issue is the cosine function? (seems unlikely)
Synthesized! I had not really thought about that. Well, I see at least two opportunities for roundoff deviation to creep in. Yes, cosine is one of them. tFloat64 += 0.01
is another because 0.01 is not exactly representable in base 2. Possible causes of deviations are change in hosting CPU type, compiler or runtime math optimizations, and runtime code fixes and improvements. On reflection I think cosine is the more likely culprit. All those Taylor series with their vanishing fractions...
This still does not rule out the possibility of some deviation in the way OpenDAP is communicating the synthesized data to the client.
Rather than diving deeper, I suggest a tentative easy solution that may be good for the long run. Generate a one-time actual netCDF file that will output the expected test.67 pattern through OpenDAP. Install this netCDF file in place of synthesis on remotetest.unidata. Then the future is insulated from possible new math deviations within synthesis. Also, if crafted properly, then all of the old and current distributed netCDF code release versions will start working properly again. As a further benefit, this would exercise more of the active OpenDAP code on the server side.
Attached is one such crafted netCDF file that should regenerate the expected test.67 pattern, for your consideration. Also included is the full-precision CDL file that generated the netCDF file using ncgen. test.67.regen4.zip
Changing the dts war to serve up a fixed file for a single test is, I suspect, a rather large task. It would be easier to do it for all test files, but that too is still a large task. It is worth remembering that the reason for remotetest is to check out the actual operation of the OPeNDAP protocol with a real server. The actual data is not particularly important. So we could modify test.67 to generate all integers instead of double values.
To report a non-security related issue, please provide:
For some reasons, it seems that
test.67
is failing due to rounding errors.Is there a way to relax the tolerance, or to skip the test in CIs while the fix is implemented?
Thank you very much for your help
See full logs: https://github.com/conda-forge/libnetcdf-feedstock/pull/135 test.67_failure_example.txt