SCOREC / core

parallel finite element unstructured meshes
Other
179 stars 63 forks source link

field_io test broken with gcc in release mode #360

Closed jacobmerson closed 2 years ago

cwsmith commented 2 years ago

The field_io test reads an (1) input mesh, (2) adds a field, (3) writes the mesh with the attached field, (4) reads the mesh, and (5) verifies that the field matches.

Building with -DCMAKE_BUILD_TYPE=Release results in failed (5) field verification. Building with -DCMAKE_BUILD_TYPE= results in successful (5) field verification.

After step (3) there are two meshes; one created by the release build (releaseBuild), releaseMesh, and another from the non-release build (build), mesh. Modifying the test to skip steps (1-3) and passing both meshes to both builds produces the following field verification results:

build/mesh releaseMesh mesh
releasebuild Fails Passes
build Fails Passes
jacobmerson commented 2 years ago

This is almost certainly some undefined behavior coming out to bite us. We should probably run with UBSAN and ASAN to see if we catch anything. Otherwise this might be a challenge to hunt down.

cwsmith commented 2 years ago

The write with the release build appears to be broken and valgrind doesn't find any errors. I'm checking the input to pcu_write_doubles now to see if it is getting the expected values. If it does, then running the sanitizers will be my next step.

cwsmith commented 2 years ago

It looks like the result of pcu_swap_doubles is different between an release (left) and non-release (right) build. The print statement where there is a diff is here: https://github.com/SCOREC/core/blob/9a419202490f452270b9f970ad61f714279f04b5/pcu/pcu_io.c#L343

diff

The following tests succeed in both builds:

    double d = 42; const double d_orig = 42;
    pcu_swap_64((uint32_t*)&d);
    pcu_swap_64((uint32_t*)&d);
    PCU_ALWAYS_ASSERT(d==d_orig);

but the following fails in only the release build:

      const double d_orig[2] = {42,43};
      double d[2]  = {42,43};
      pcu_swap_doubles(d, 2);
      pcu_swap_doubles(d, 2);
      PCU_ALWAYS_ASSERT(d[0]==d_orig[0]);
      PCU_ALWAYS_ASSERT(d[1]==d_orig[1]);
cwsmith commented 2 years ago

The swapDoubles test added with commit https://github.com/SCOREC/core/commit/e5498e04f9f1f75c880cdc72470c4adc672fc30e fails with a release build and passes with a non-release build.

Release builds with ubsan (-g -fsanitize=undefined) and, separately, asan (-g -fsanitize=address) do not produce any additional information (and still fails the test).

edit/update: The initial test had 32 doubles in the array (n=32). The test fails when n>=2.

cwsmith commented 2 years ago

This should be fixed as of 830be9b.

Update: All the github action tests passed. https://github.com/SCOREC/core/actions/runs/1931426541

jacobmerson commented 2 years ago

@cwsmith does this mean that anyone who saved a mesh with release build has potentially corrupted data? We may want to put a big note or warning somewhere.

cwsmith commented 2 years ago

Yeah, if older/other compilers optimized in the same way or, someone was using latest/newish GCC as done here, then there certainly could be broken smb meshes in the wild.

Any thoughts on where to put the notice of the bug besides expecting the affected users to find it in this issue?

I'm thinking an SMB_VERSION increase to may make sense to aid future debugging of the problem.

https://github.com/SCOREC/core/blob/4dc04098b5ef18cd8a239e2a914c434e9bd6430e/mds/mds_smb.c#L25

jacobmerson commented 2 years ago

Not sure but I definitely think incrementing the smb version is a good idea. In the release notes we can put a comment about why the smb version is incremented.