Open Status-Mirror opened 1 year ago
I've done some digging, and I have more clues.
EPOCH reads a global binary file and writes the values to a local array using the subroutine load_single_array_from_file
in io/simple_io.F90
. To inspect what the code was doing during this error, I inserted lines into the source-code of this subroutine just after the call to MPI_FILE_READ_ALL
. These lines were:
DO i = 0,nproc-1
CALL MPI_BARRIER(comm, errcode)
IF (RANK == i) PRINT*, "RANK, ARRAY(10,5:9)", RANK, ARRAY(10,5:9)
IF (RANK == nproc-1) STOP
END DO
which had the effect of cycling through the ranks, one at a time, and outputting the values in array after the file was read. This output included the lines:
RANK, ARRAY(10,5:9) 0, 0., 4*0.43496553411123029
RANK, ARRAY(10,5:9) 1, 0., 4*0.10027526670709096
RANK, ARRAY(10,5:9) 2, 0., 4*0.24654331551411771
RANK, ARRAY(10,5:9) 256, 0., 4*0.43496553411123029
RANK, ARRAY(10,5:9) 257, 0., 4*0.10027526670709096
RANK, ARRAY(10,5:9) 512, 0., 4*0.43496553411123029
RANK, ARRAY(10,5:9) 768, 0., 4*0.43496553411123029
RANK, ARRAY(10,5:9) 1024, 0., 4*0.43496553411123029
RANK, ARRAY(10,5:9) 6398, 0., 4*0.35322830472504285
RANK, ARRAY(10,5:9) 6399, 0., 4*0.11123303007320126
RANK, ARRAY(10,5:9) 6400, 0., 0.43496553411123029, 0.10027526670709096, 0.24654331551411771, 0.56354243133256987
RANK, ARRAY(10,5:9) 6401, 0., 0.10027526670709096, 0.24654331551411771, 0.56354243133256987, 0.81238097032662238
RANK, ARRAY(10,5:9) 8191, 0., 0.11123303007320126, 0.43496553411123029, 0.19910453764919822, 4.90921027173426838E-2
There are a few points I would like to make here:
ARRAY(10,5) = 0
on all ranks. Does this routine not write to ARRAY(10,1:5)
due to ghost cells?ARRAY(10,6:9)
are constant on each rank, as expected.Why would the MPI routines stop working for high rank number? It might be a good idea to replace the binary file with a list of ascending numbers, so that I know exactly which numbers each rank is reading.
When attempting to set number densities or temperatures using binary files on very large grids ((47495 x 6447) cells on (256 x 32) processors), EPOCH2D fails to correctly read the files on ranks with high $y$. This error only affects large grids - so far this bug has only been found on Archer2 simulations. Other computers have failed to initialise the simulation due to memory restraints. At this stage, it is not clear if this is an EPOCH bug, or an Archer2 bug.
Binary files for testing have been created using MATLAB, and they are ~2GB in size. The MATLAB script which generated this .bin file is provided below:
The input deck producing the error requires absolute paths to the binary file - this needs to be changed to match the current working enviornment. My input deck reads:
The resulting number density plot has been provided below. The Archer2 job ran using 64 nodes, with 128 processors per node. A 15 minute runtime is sufficient to write the 0000.sdf dump. For clarity, only values up to $x\approx 430\mu m$ are displayed.
In this figure, it can be seen the high y processors do not load the binary file. However, we can see that particles have been loaded into cells which roughly correspond to the processor boundaries. This is displayed more clearly by plotting a lineout for constant $x$:
The high-y spikes correspond to y indices: 5236, 5439, 5641, 5843, 6044, 6247, and the last y-index to contain a non-zero density is 5035. Each y processor contains ~201.4 cells, so we can say that the final 7 processors in $y$ are failing to load the particles, except on their boundaries.
Maybe try reading through the binary file loaders, and look for any size limits?
Good luck, Stuart