Warwick-Plasma / epoch

Particle-in-cell code for plasma physics simulations
https://epochpic.github.io
GNU General Public License v3.0
186 stars 59 forks source link

Archer2 bug: Loading binary files #536

Open Status-Mirror opened 1 year ago

Status-Mirror commented 1 year ago

When attempting to set number densities or temperatures using binary files on very large grids ((47495 x 6447) cells on (256 x 32) processors), EPOCH2D fails to correctly read the files on ranks with high $y$. This error only affects large grids - so far this bug has only been found on Archer2 simulations. Other computers have failed to initialise the simulation due to memory restraints. At this stage, it is not clear if this is an EPOCH bug, or an Archer2 bug.

Binary files for testing have been created using MATLAB, and they are ~2GB in size. The MATLAB script which generated this .bin file is provided below:

nx = 47495;
ny = 6447;
dx = 9.307e-9;
dy = 9.307e-9;

x_edges = linspace(0, nx*dx, nx+1);
y_edges = linspace(0, ny*dy, ny+1);

% Generate sinusoidal density
x_centres = 0.5*(x_edges(2:end) + x_edges(1:end-1));
abs_sin_1d = abs(sin(x_centres / (10*dx)));
density = zeros(nx, ny);
for iy = 1:ny 
    density(:,iy) = abs_sin_1d;
end

% Convert 2D density to a binary file
% FORTRAN scans through data in order (1,1), (2,1), ... (nx,1), (1,2), ...
% Hence, write our density array in this order
fileID = fopen('density.bin','w');
for iy = 1:ny
    for ix = 1:nx
        % Write array element in double-precision binary format
        fwrite(fileID, density(ix,iy), 'double');
    end
end
fclose(fileID);

The input deck producing the error requires absolute paths to the binary file - this needs to be changed to match the current working enviornment. My input deck reads:

begin:control
    nx = 47495
    ny = 6447
    npart = 4*nx*ny
    t_end = 1e-12
    x_max = 863.7128226488906e-6
    x_min = 421.7422341610174e-6
    y_max = 30.0e-6
    y_min = -30.0e-6
    nprocx=256
    nprocy=32
end:control

begin:boundaries
    bc_x_min = thermal
    bc_x_max = simple_laser
    bc_y_min = periodic
    bc_y_max = periodic
end:boundaries

begin:species
    name = electrons
    charge = -1
    number_density = '/work/e689/e689/stu/debug/epoch/epoch2d/test/density1.bin'
    frac = 0.5
    mass = 1.0
    temp = 0.0
    bc_x_max = thermal
end:species

begin:output
    dt_snapshot = t_end
    number_density = always
    dump_last = F
end:output

The resulting number density plot has been provided below. The Archer2 job ran using 64 nodes, with 128 processors per node. A 15 minute runtime is sufficient to write the 0000.sdf dump. For clarity, only values up to $x\approx 430\mu m$ are displayed.

new_plot

In this figure, it can be seen the high y processors do not load the binary file. However, we can see that particles have been loaded into cells which roughly correspond to the processor boundaries. This is displayed more clearly by plotting a lineout for constant $x$:

lineout

The high-y spikes correspond to y indices: 5236, 5439, 5641, 5843, 6044, 6247, and the last y-index to contain a non-zero density is 5035. Each y processor contains ~201.4 cells, so we can say that the final 7 processors in $y$ are failing to load the particles, except on their boundaries.

Maybe try reading through the binary file loaders, and look for any size limits?

Good luck, Stuart

Status-Mirror commented 1 year ago

I've done some digging, and I have more clues.

EPOCH reads a global binary file and writes the values to a local array using the subroutine load_single_array_from_file in io/simple_io.F90. To inspect what the code was doing during this error, I inserted lines into the source-code of this subroutine just after the call to MPI_FILE_READ_ALL. These lines were:

    DO i = 0,nproc-1
      CALL MPI_BARRIER(comm, errcode)
      IF (RANK == i) PRINT*, "RANK, ARRAY(10,5:9)", RANK, ARRAY(10,5:9) 
      IF (RANK == nproc-1) STOP
    END DO

which had the effect of cycling through the ranks, one at a time, and outputting the values in array after the file was read. This output included the lines:

 RANK, ARRAY(10,5:9) 0,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 1,  0.,  4*0.10027526670709096
 RANK, ARRAY(10,5:9) 2,  0.,  4*0.24654331551411771
 RANK, ARRAY(10,5:9) 256,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 257,  0.,  4*0.10027526670709096
 RANK, ARRAY(10,5:9) 512,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 768,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 1024,  0.,  4*0.43496553411123029
 RANK, ARRAY(10,5:9) 6398,  0.,  4*0.35322830472504285
 RANK, ARRAY(10,5:9) 6399,  0.,  4*0.11123303007320126
 RANK, ARRAY(10,5:9) 6400,  0.,  0.43496553411123029,  0.10027526670709096,  0.24654331551411771,  0.56354243133256987
 RANK, ARRAY(10,5:9) 6401,  0.,  0.10027526670709096,  0.24654331551411771,  0.56354243133256987,  0.81238097032662238
 RANK, ARRAY(10,5:9) 8191,  0.,  0.11123303007320126,  0.43496553411123029,  0.19910453764919822,  4.90921027173426838E-2

There are a few points I would like to make here:

Why would the MPI routines stop working for high rank number? It might be a good idea to replace the binary file with a list of ascending numbers, so that I know exactly which numbers each rank is reading.