Open rtobar opened 3 years ago
tl;fr: the cause of this issue is a mismatch between the logic in the distribution of particles to worker ranks during HDF5 reading, and the amount of memory reserved on those ranks for receiving particle data. The workaround is to avoid using VR_MPI_REDUCE=OFF
.
Before Particle data is read, VR figures out the amount of particles that will need to be read, how they will be distributed across ranks, and allocates the memory on each rank to hold this. This calculation yields different results when using VR_MPI_REDUCE=OFF
. In the example data quoted in the issue description I get (running with 2 ranks):
VR_MPI_REDUCE=ON
[0000] [ 0.243] [ info] main.cxx:187 There are 1661132 particles in total that require 253.469 [MiB]
[0000] [ 0.244] [ info] main.cxx:189 There are 830548 baryon particles in total that require 126.732 [MiB]
[0000] [ 0.244] [ info] mpihdfio.cxx:52 Loading HDF header info in header group: Header
[0000] [ 0.246] [ info] mpiroutines.cxx:239 Z-curve Mesh MPI decomposition:
[0000] [ 0.246] [ info] mpiroutines.cxx:240 Mesh has resolution of 8 per spatial dim
[0000] [ 0.246] [ info] mpiroutines.cxx:241 with each mesh spanning (0.781, 0.781, 0.781)
[0000] [ 0.246] [ info] mpiroutines.cxx:242 MPI tasks :
[0000] [ 0.246] [ info] mpiroutines.cxx:244 Task 0 has 0.500 of the volume
[0000] [ 0.246] [ info] mpiroutines.cxx:244 Task 1 has 0.500 of the volume
[0000] [ 0.884] [ warn] mpiroutines.cxx:366 Suggested number of particles per mpi processes is roughly > 1e7
[0000] [ 0.884] [ warn] mpiroutines.cxx:367 Number of MPI tasks greater than this suggested number
[0000] [ 0.884] [ warn] mpiroutines.cxx:368 May result in poor performance
[0000] [ 0.884] [ info] mpiroutines.cxx:296 MPI imbalance of 0.734
[0000] [ 0.884] [ info] mpiroutines.cxx:298 Imbalance too large, adjusting MPI domains ...
[0000] [ 0.884] [ info] mpiroutines.cxx:331 Now have MPI imbalance of 0.008
[0000] [ 0.884] [ info] mpiroutines.cxx:332 MPI tasks:
[0000] [ 0.884] [ info] mpiroutines.cxx:334 Task 0 has 0.518 of the volume
[0000] [ 0.884] [ info] mpiroutines.cxx:334 Task 1 has 0.482 of the volume
[0000] [ 0.885] [ info] mpihdfio.cxx:52 Loading HDF header info in header group: Header
[0000] [ 0.886] [ info] mpiroutines.cxx:239 Z-curve Mesh MPI decomposition:
[0000] [ 0.886] [ info] mpiroutines.cxx:240 Mesh has resolution of 8 per spatial dim
[0000] [ 0.886] [ info] mpiroutines.cxx:241 with each mesh spanning (0.781, 0.781, 0.781)
[0000] [ 0.886] [ info] mpiroutines.cxx:242 MPI tasks :
[0000] [ 0.886] [ info] mpiroutines.cxx:244 Task 0 has 0.500 of the volume
[0000] [ 0.886] [ info] mpiroutines.cxx:244 Task 1 has 0.500 of the volume
[0000] [ 1.509] [ info] main.cxx:226 Have allocated enough memory for 578257 particles requiring 88.235 [MiB]
[0000] [ 1.509] [ info] main.cxx:229 Have allocated enough memory for 0 baryons particles requiring 0 [B]
[0001] [ 1.507] [ info] main.cxx:226 Have allocated enough memory for 1248987 particles requiring 190.580 [MiB]
[0001] [ 1.507] [ info] main.cxx:229 Have allocated enough memory for 0 baryons particles requiring 0 [B]
[0001] [ 1.507] [ info] main.cxx:233 Will also require additional memory for FOF algorithms and substructure search. Largest mem needed for preliminary FOF search. Rough estimate is 34.651 [MiB]
[0000] [ 1.509] [ info] main.cxx:233 Will also require additional memory for FOF algorithms and substructure search. Largest mem needed for preliminary FOF search. Rough estimate is 16.043 [MiB]
VR_MPI_REDUCE=OFF
[0000] [ 0.247] [ info] main.cxx:187 There are 1661132 particles in total that require 253.469 [MiB]
[0000] [ 0.247] [ info] main.cxx:189 There are 830548 baryon particles in total that require 126.732 [MiB]
[0000] [ 0.247] [ info] mpihdfio.cxx:52 Loading HDF header info in header group: Header
[0000] [ 0.249] [ info] mpiroutines.cxx:239 Z-curve Mesh MPI decomposition:
[0000] [ 0.249] [ info] mpiroutines.cxx:240 Mesh has resolution of 8 per spatial dim
[0000] [ 0.249] [ info] mpiroutines.cxx:241 with each mesh spanning (0.781, 0.781, 0.781)
[0000] [ 0.249] [ info] mpiroutines.cxx:242 MPI tasks :
[0000] [ 0.249] [ info] mpiroutines.cxx:244 Task 0 has 0.500 of the volume
[0000] [ 0.249] [ info] mpiroutines.cxx:244 Task 1 has 0.500 of the volume
[0000] [ 0.250] [ info] main.cxx:226 Have allocated enough memory for 830566 particles requiring 126.734 [MiB]
[0000] [ 0.250] [ info] main.cxx:229 Have allocated enough memory for 415274 baryons particles requiring 63.366 [MiB]
[0000] [ 0.250] [ info] main.cxx:233 Will also require additional memory for FOF algorithms and substructure search. Largest mem needed for preliminary FOF search. Rough estimate is 25.347 [MiB]
[0001] [ 0.247] [ info] main.cxx:226 Have allocated enough memory for 830566 particles requiring 126.734 [MiB]
[0001] [ 0.248] [ info] main.cxx:229 Have allocated enough memory for 415274 baryons particles requiring 63.366 [MiB]
[0001] [ 0.248] [ info] main.cxx:233 Will also require additional memory for FOF algorithms and substructure search. Largest mem needed for preliminary FOF search. Rough estimate is 25.347 [MiB]
The main difference here is in how the MPI domain is divided, with different number of particles associated to each rank (578257/1248987 when ON
, 830566/830566 when OFF
). There is also different number of baryon particles reporting of 0 baryons particles
with VR_MPI_REDUCE=ON
v/s 415274 baryons particles
on both ranks with VR_MPI_REDUCE=OFF
, which I haven't looked into yet (but might be just a reporting error?).
Later on, the HDF5 reading code seems to read particle data in ~50MB chunks (312500 Particle objects), which are then sent to the worker rank that will process them. However, the number of chunks that are read and sent to the worker ranks doesn't seem to take into account the MPI decomposition information from above, which leads to the error with VR_MPI_REDUCE=OFF
(log statements added locally):
[0000] [ 3.568] [debug] mpiroutines.cxx:1355 Sending data for 312500 particles to rank 1
[0001] [ 3.566] [debug] mpiroutines.cxx:2112 Receiving 312500 particles from rank 0
[0000] [ 5.521] [debug] mpiroutines.cxx:1355 Sending data for 312500 particles to rank 1
[0001] [ 5.519] [debug] mpiroutines.cxx:2112 Receiving 312500 particles from rank 0
[0000] [ 5.939] [debug] mpiroutines.cxx:1355 Sending data for 312500 particles to rank 1
[0001] [ 6.195] [debug] mpiroutines.cxx:2112 Receiving 312500 particles from rank 0
[bolano:52761] Read -1, expected 50000000, errno = 14
[bolano:52761] *** Process received signal ***
[bolano:52761] Signal: Segmentation fault (11)
This doesn't happen with VR_MPI_REDUCE=ON
:
[0000] [ 1.817] [debug] hdfio.cxx:1126 Opening Dataset PartType5/SubgridMasses
[0000] [ 4.924] [debug] mpiroutines.cxx:1355 Sending data for 312500 particles to rank 1
[0001] [ 4.922] [debug] mpiroutines.cxx:2112 Receiving 312500 particles from rank 0
[0000] [ 7.434] [debug] mpiroutines.cxx:1355 Sending data for 312500 particles to rank 1
[0001] [ 7.432] [debug] mpiroutines.cxx:2112 Receiving 312500 particles from rank 0
[0000] [ 7.853] [debug] mpiroutines.cxx:1355 Sending data for 312500 particles to rank 1
[0001] [ 8.122] [debug] mpiroutines.cxx:2112 Receiving 312500 particles from rank 0
[0000] [ 8.328] [debug] hdfio.cxx:3085 Sending data for 197943 particles to rank 1
[0001] [ 8.326] [debug] mpiroutines.cxx:2112 Receiving 197943 particles from rank 0
[0000] [ 8.511] [ info] io.cxx:127 Done loading input data
I will try to investigate this further and will post any further updates here.
After further reading, I learned that the HDF5 reading functions inspect the MPI decomposition information (the global mpi_domain
array) to determine which ranks should receive which Particles, and that seems the correct way of doing things. With VR_MPI_REDUCE=OFF
however there appears to be a disconnect between this MPI decomposition information and the particles/baryons counts that each rank thinks it will work with (the again global Nlocal
, Nmemlocal
et al.). These particle counts are used to allocate the vectors that will contain Particles, but are also used in many other places, so getting this information right should be the solution for this problem.
With VR_MPI_REDUCE=ON
both the MPI domain decomposition and the particle counts are calculated internally through this call in main.cxx
, apparently in a consistent manner:
MPINumInDomain(opt);
On the other hand, with VR_MPI_REDUCE=OFF
they are disconnected:
MPIDomainExtent(opt);
MPIDomainDecomposition(opt);
Nlocal=nbodies/NProcs*MPIProcFac;
Nmemlocal=Nlocal;
Nlocalbaryon[0]=nbaryons/NProcs*MPIProcFac;
Nmemlocalbaryon=Nlocalbaryon[0];
NExport=NImport=Nlocal*MPIExportFac;
In particular, Nlocal
and others are not calculated based on mpi_domain
anymore, but solely based on the number of ranks. Nmemlocal
is what is used to allocate the vectors of Particles, hence the invalid writes afterwards.
I tried a quick fix, basically resizing the original vector of Particles when more-than-expected particles are received via MPI; this however only moved the problem further down the line, as expected given that the particle counts are used in other contexts:
[0001] [ 26.538] [ info] search.cxx:316 MPI search will require extra memory of 0 [B]
[0001] [ 26.538] [ info] mpiroutines.cxx:3322 Now building exported particle list for FOF search
double free or corruption (out)
[bolano:30959] *** Process received signal ***
[bolano:30959] Signal: Aborted (6)
[bolano:30959] Signal code: (-6)
[bolano:30959] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14bb0)[0x7f577935ebb0]
[bolano:30959] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7f5778e128cb]
[bolano:30959] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x116)[0x7f5778df7864]
[bolano:30959] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89af6)[0x7f5778e5aaf6]
[bolano:30959] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9246c)[0x7f5778e6346c]
[bolano:30959] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x94328)[0x7f5778e65328]
[bolano:30959] [ 6] builds/56/stf(_ZN9__gnu_cxx13new_allocatorIiE10deallocateEPim+0x24)[0x55fec20982d6]
[bolano:30959] [ 7] builds/56/stf(_ZNSt16allocator_traitsISaIiEE10deallocateERS0_Pim+0x2f)[0x55fec2097870]
[bolano:30959] [ 8] builds/56/stf(_ZNSt12_Vector_baseIiSaIiEE13_M_deallocateEPim+0x36)[0x55fec209654c]
[bolano:30959] [ 9] builds/56/stf(_ZNSt12_Vector_baseIiSaIiEED2Ev+0x42)[0x55fec2094c00]
[bolano:30959] [10] builds/56/stf(_ZNSt6vectorIiSaIiEED2Ev+0x45)[0x55fec2094c55]
[bolano:30959] [11] builds/56/stf(_Z35MPIBuildParticleExportListUsingMeshR7OptionsxPN5NBody8ParticleERPxRPid+0x769)[0x55fec21ad549]
[bolano:30959] [12] builds/56/stf(_Z13SearchFullSetR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERx+0x1b50)[0x55fec21d9640]
[bolano:30959] [13] builds/56/stf(main+0xf5a)[0x55fec20876c4]
[bolano:30959] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf2)[0x7f5778df9cb2]
[bolano:30959] [15] builds/56/stf(_start+0x2e)[0x55fec208646e]
[bolano:30959] *** End of error message ***
free(): invalid size
[bolano:30958] *** Process received signal ***
[bolano:30958] Signal: Aborted (6)
[bolano:30958] Signal code: (-6)
[bolano:30958] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x14bb0)[0x7ff34812fbb0]
[bolano:30958] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7ff347be38cb]
[bolano:30958] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x116)[0x7ff347bc8864]
[bolano:30958] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89af6)[0x7ff347c2baf6]
[bolano:30958] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9246c)[0x7ff347c3446c]
[bolano:30958] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x93e94)[0x7ff347c35e94]
[bolano:30958] [ 6] builds/56/stf(_ZN9__gnu_cxx13new_allocatorIiE10deallocateEPim+0x24)[0x55a281ecf2d6]
[bolano:30958] [ 7] builds/56/stf(_ZNSt16allocator_traitsISaIiEE10deallocateERS0_Pim+0x2f)[0x55a281ece870]
[bolano:30958] [ 8] builds/56/stf(_ZNSt12_Vector_baseIiSaIiEE13_M_deallocateEPim+0x36)[0x55a281ecd54c]
[bolano:30958] [ 9] builds/56/stf(_ZNSt12_Vector_baseIiSaIiEED2Ev+0x42)[0x55a281ecbc00]
[bolano:30958] [10] builds/56/stf(_ZNSt6vectorIiSaIiEED2Ev+0x45)[0x55a281ecbc55]
[bolano:30958] [11] builds/56/stf(_Z35MPIBuildParticleExportListUsingMeshR7OptionsxPN5NBody8ParticleERPxRPid+0x769)[0x55a281fe4549]
[bolano:30958] [12] builds/56/stf(_Z13SearchFullSetR7OptionsxRSt6vectorIN5NBody8ParticleESaIS3_EERx+0x1b50)[0x55a282010640]
[bolano:30958] [13] builds/56/stf(main+0xf5a)[0x55a281ebe6c4]
[bolano:30958] [14] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf2)[0x7ff347bcacb2]
[bolano:30958] [15] builds/56/stf(_start+0x2e)[0x55a281ebd46e]
[bolano:30958] *** End of error message ***
--------------------------------------------------------------------------
On a separate point, I still don't have a clear idea of what the effect of VR_MPI_REDUCE
should be. The documentation says that it should Reduce impact of MPI memory overhead at the cost of extra cpu cycles, but this is a very generic statement. The code seems to point out that, when OFF, the MPI domain would end up broken down in equally-sized sections, but I'm probably not understanding the intent correctly.
As described in https://github.com/ICRAR/VELOCIraptor-STF/issues/54#issuecomment-745611275 by @MatthieuSchaller:
This issue is to keep track of the last sentence. Indeed when running with -DVR_MPI_REDUCE=OFF the following crash happens: