LLNL / UnifyFS

UnifyFS: A file system for burst buffers
Other
102 stars 31 forks source link

PnetCDF mcoll_perf detects incorrect data #757

Open adammoody opened 1 year ago

adammoody commented 1 year ago

The test/nonblocking/mcoll_perf.c test detects incorrect data when comparing two files that were written two different ways which should have identical content.

cd test/nonblocking
srun -n2 ./mcoll_perf /unifyfs/testfile.nc
<snip>
P0: diff at line 282 variable[2] var1_2: NC_INT buf1 != buf2 at position 32762

After tracing pwrite and pread calls under a debugger, the problem is that both ranks write to the same byte offsets without any synchronization in between. In this case, rank 1 writes a fill value and rank 0 later writes actual data. It's a race as to which value actually ends up in the file.

The fill call is here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L521

When filling the variable 2, rank 1 writes to (offset=648, length=8) and (offset=680, length=8).

And the write call is here:

https://github.com/Parallel-NetCDF/PnetCDF/blob/bb59553ca3542bc09ead12c6ce4e65b913ef51fa/test/nonblocking/mcoll_perf.c#L526

In that write, rank 0 writes to (offset=640, length=16) and (offset=672, length=16), which overlaps with the region that rank 1 wrote to during the fill operation.

The test case can be fixed by adding a call to ncmpi_sync(ncid);:

           for (i=2; i<nvars; i++){
                /* fill record variables to silence valgrind complaining about uninitialised bytes */
                for (j=0; j<array_of_gsizes[0]; j++) {
                    err = ncmpi_fill_var_rec(ncid, varid[i], j);
                    CHECK_ERR
                }
            }
            ncmpi_sync(ncid); // <--- add sync here to fix the test case
            for (i=0; i<nvars; i++){
                err = ncmpi_put_vara_all(ncid, varid[i], starts[i], counts[i], buf[i], bufcounts[i], MPI_INT);
                CHECK_ERR
            }

For reference, here is the sequence of (offset, length) values for writes from different ranks when k==0. There are multiple overlapping writes, one of which is shown below:

offset, length values for writes
--------  -------
rank 0    rank 1
--------  -------
  0, 336
512, 32   544, 32
576, 32   608, 32
640, 8    648, 8  <--- this "fill" by rank 1
  4, 4
672, 8    680, 8
  4, 4
704, 8    712, 8
  4, 4
736, 8    744, 8
  4, 4
656, 8    664, 8
688, 8    696, 8
720, 8    728, 8
752, 8    760, 8
512, 32   544, 32
576, 32   608, 32
640, 16   704, 16  <-- overlaps with this "put" by rank 0
672, 16   736, 16
656, 16   720, 16
688, 16   752, 16