Open MatthewChristoff opened 3 years ago
UPDATE: The issue was found. Because CSR uses an MTVs
to store its particle data and continually makes and destroys them, these get
calls were leaving a few smart pointers to the original set of data. Thus, when rebuilding, CSR was using 3x the memory of ptcl_data
instead of just 2x. Currently, this has been fixed by enclosing these get
calls in a for loop, thus causing these smart pointers to go out-of-scope before the call to migrate
/rebuild
.
A general fix has been proposed and is currently underway whereby a second copy of ptcl_data
would be stored at all times for swapping purposes (like SCS) for both CSR and CabanaM.
Once CSR has its swapping implementation done, we could probably close this issue, although the issue is still technically there for cases in which CSR increases in size so that it triggers a full rebuild.
I've been working on the
cabmBuild
branch and noticed that we have some unexpected behavior while testingCSR
. A new version of the testing code,ps_combo.cpp
, was made to test larger amounts of data per particle,ps_combo32.cpp
(which uses a size 32 array of doubles for each particle instead of the original size 3 array). This is linked here.During comparative testing for
CabM
on AiMOS, it was found thatCSR
ceases due to anout of memory
error at 50,000 elements and 50,000,000 particles. The error message is included below:However, both the
SCS
andCabM
particle structures do not fail until our next set of tests at 75,000 elements and 75,000,000 particles. We investigated and attempted to runps_combo32
again with the number of iterations at line 89 (originally 100) reduced to 1.In this case, all three particle structures failed due to an(See Below Edit)out of memory
error at 75,000 elements and 75,000,000 particles. This leads me to suspect that there is some sort of large-scale memory error inCSR
or possibly the testing code.For reference, the set of tests we were running are in the file,
test_largeE_largeP.sh
, located here (using the second commented-out call tops_combo
for use on AiMOS).EDIT: Upon further inspection, this does not seem to be a memory leak. However, it is the case that
CSR
is using much more memory than expected. I've checked, and it seems thatparticles_on_process
is being calculated correctly, here. I ran some performance tests onCSR
using the Kokkos memory-usage tools, here with the testmpirun -np 1 ./ps_combo160 1000 1000000 1 -n 1
on a 6-GPU node on AiMOS. I found that, at their maximums,CabM
uses 331.2 MB andCSR
uses 470.8 MB. This is unexpected behavior becauseCabM
should be allocating more memory through the use of padding. I think I've tracked it down to theparticle_info
temporaryMTVs
inCSR::rebuild
, here, but I'm not sure how it could be allocating this much extra space.