Archer2 bug: Loading binary files

Warwick-Plasma / epoch

Particle-in-cell code for plasma physics simulations

GNU General Public License v3.0

186 stars 59 forks source link

nx = 47495; ny = 6447; dx = 9.307e-9; dy = 9.307e-9; x_edges = linspace(0, nx*dx, nx+1); y_edges = linspace(0, ny*dy, ny+1); % Generate sinusoidal density x_centres = 0.5*(x_edges(2:end) + x_edges(1:end-1)); abs_sin_1d = abs(sin(x_centres / (10*dx))); density = zeros(nx, ny); for iy = 1:ny density(:,iy) = abs_sin_1d; end % Convert 2D density to a binary file % FORTRAN scans through data in order (1,1), (2,1), ... (nx,1), (1,2), ... % Hence, write our density array in this order fileID = fopen('density.bin','w'); for iy = 1:ny for ix = 1:nx % Write array element in double-precision binary format fwrite(fileID, density(ix,iy), 'double'); end end fclose(fileID);

begin:control nx = 47495 ny = 6447 npart = 4*nx*ny t_end = 1e-12 x_max = 863.7128226488906e-6 x_min = 421.7422341610174e-6 y_max = 30.0e-6 y_min = -30.0e-6 nprocx=256 nprocy=32 end:control begin:boundaries bc_x_min = thermal bc_x_max = simple_laser bc_y_min = periodic bc_y_max = periodic end:boundaries begin:species name = electrons charge = -1 number_density = '/work/e689/e689/stu/debug/epoch/epoch2d/test/density1.bin' frac = 0.5 mass = 1.0 temp = 0.0 bc_x_max = thermal end:species begin:output dt_snapshot = t_end number_density = always dump_last = F end:output

I've done some digging, and I have more clues.

EPOCH reads a global binary file and writes the values to a local array using the subroutine load_single_array_from_file in io/simple_io.F90. To inspect what the code was doing during this error, I inserted lines into the source-code of this subroutine just after the call to MPI_FILE_READ_ALL. These lines were:

    DO i = 0,nproc-1
      CALL MPI_BARRIER(comm, errcode)
      IF (RANK == i) PRINT*, "RANK, ARRAY(10,5:9)", RANK, ARRAY(10,5:9) 
      IF (RANK == nproc-1) STOP
    END DO

which had the effect of cycling through the ranks, one at a time, and outputting the values in array after the file was read. This output included the lines:

 RANK, ARRAY(10,5:9) 0,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 1,  0.,  4*0.10027526670709096
 RANK, ARRAY(10,5:9) 2,  0.,  4*0.24654331551411771
 RANK, ARRAY(10,5:9) 256,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 257,  0.,  4*0.10027526670709096
 RANK, ARRAY(10,5:9) 512,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 768,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 1024,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 6398,  0.,  4*0.35322830472504285
 RANK, ARRAY(10,5:9) 6399,  0.,  4*0.11123303007320126
 RANK, ARRAY(10,5:9) 6400,  0.,  0.43496553411123029,  0.10027526670709096,  0.24654331551411771,  0.56354243133256987
 RANK, ARRAY(10,5:9) 6401,  0.,  0.10027526670709096,  0.24654331551411771,  0.56354243133256987,  0.81238097032662238
 RANK, ARRAY(10,5:9) 8191,  0.,  0.11123303007320126,  0.43496553411123029,  0.19910453764919822,  4.90921027173426838E-2

There are a few points I would like to make here:

The binary file produces densities which are constant in $y$, so by outputting the array between y indices 5 and 9, we expect all numbers to be the same
ARRAY(10,5) = 0 on all ranks. Does this routine not write to ARRAY(10,1:5) due to ghost cells?
For ranks 0-6399, the values of ARRAY(10,6:9) are constant on each rank, as expected.
Ranks 0, 256, 512, 768 and 1024 are stacked vertically, and as the outputs share x index 10, the array values are equal here as expected.
On the first 25 processor rows, the numbers behave as expected. For the first processor on the 26th row, we read values incorrectly. Instead of all numbers being the same, the numbers are different.
For the first incorrect array (rank 6400), the recorded numbers match the values printed by the first arrays! This rank is still reading the file as before, but the numbers it reads are wrong.

Why would the MPI routines stop working for high rank number? It might be a good idea to replace the binary file with a list of ascending numbers, so that I know exactly which numbers each rank is reading.

Warwick-Plasma / epoch

Archer2 bug: Loading binary files #536