SCOREC / pumi-pic

support libraries for unstructured mesh particle in cell simulations on GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
36 stars 15 forks source link

CSR Having Unexpectedly Large Memory Usage #58

Open MatthewChristoff opened 3 years ago

MatthewChristoff commented 3 years ago

I've been working on the cabmBuild branch and noticed that we have some unexpected behavior while testing CSR. A new version of the testing code, ps_combo.cpp, was made to test larger amounts of data per particle, ps_combo32.cpp (which uses a size 32 array of doubles for each particle instead of the original size 3 array). This is linked here.

During comparative testing for CabM on AiMOS, it was found that CSR ceases due to an out of memory error at 50,000 elements and 50,000,000 particles. The error message is included below:

Test Command:
 ./ps_combo32 50000 50000000 1 -p 50 -n 1
Generating particle distribution with strategy: Uniform
Building CSR
Performing 100 iterations of rebuild on each structure
Beginning push on structure CSR
Beginning rebuild on structure CSR
terminate called after throwing an instance of 'std::runtime_error'
  what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /gpfs/u/barn/MPFS/MPFSmttw/pumipic_CabM/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
Traceback functionality not available

[dcs044:159743] *** Process received signal ***
[dcs044:159743] Signal: Aborted (6)
[dcs044:159743] Signal code:  (-6)
[dcs044:159743] [ 0] [0x7fff8ad704d8]
[dcs044:159743] [ 1] /usr/lib64/libc.so.6(abort+0x2b4)[0x7fff89412094]
[dcs044:159743] [ 2] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x7fff897a0644]
[dcs044:159743] [ 3] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(+0xab364)[0x7fff8979b364]
[dcs044:159743] [ 4] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(_ZSt9terminatev+0x20)[0x7fff8979b420]
[dcs044:159743] [ 5] /gpfs/u/software/ppc64le-rhel7/gcc/7.4.0/1/lib64/libstdc++.so.6(__cxa_throw+0x80)[0x7fff8979b8e0]
[dcs044:159743] [ 6] ./ps_combo32(_ZN6Kokkos4Impl23throw_runtime_exceptionERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE+0xc4)[0x101aedc0]
[dcs044:159743] [ 7] ./ps_combo32(_ZN6Kokkos4Impl25cuda_internal_error_throwE9cudaErrorPKcS3_i+0x170)[0x101b0f40]
[dcs044:159743] [ 8] ./ps_combo32(_ZN6Kokkos4Impl23cuda_internal_safe_callE9cudaErrorPKcS3_i+0x60)[0x101b4128]
[dcs044:159743] [ 9] ./ps_combo32(_ZNK6Kokkos9CudaSpace8allocateEm+0x60)[0x101b6478]
[dcs044:159743] [10] ./ps_combo32(_ZN6Kokkos4Impl22SharedAllocationRecordINS_9CudaSpaceEvEC2ERKS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEmPFvPNS1_IvvEEE+0x4c)[0x101b78a8]
[dcs044:159743] [11] ./ps_combo32(_ZN6Kokkos4ViewIPA32_dJNS_10LayoutLeftENS_6DeviceINS_4CudaENS_9CudaSpaceEEEEEC2IJNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEEEEERKNS_4Impl12ViewCtorPropIJDpT_EEERKNSt9enable_ifIXntsrSK_11has_pointerES3_E4typeE+0x10c)[0x1016632c]
[dcs044:159743] [12] ./ps_combo32(_ZN7pumipic3CSRINS_11MemberTypesIJiA32_ddEEEN6Kokkos9CudaSpaceEE7rebuildENS4_4ViewIPiJNS4_6DeviceINS4_4CudaES5_EEEEESC_PPv+0x308)[0x1017f628]
[dcs044:159743] [13] ./ps_combo32(main+0x1800)[0x100a8e60]
[dcs044:159743] [14] /usr/lib64/libc.so.6(+0x25200)[0x7fff893f5200]
[dcs044:159743] [15] /usr/lib64/libc.so.6(__libc_start_main+0xc4)[0x7fff893f53f4]
[dcs044:159743] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node dcs044 exited on signal 6 (Aborted).
--------------------------------------------------------------------------

However, both the SCS and CabM particle structures do not fail until our next set of tests at 75,000 elements and 75,000,000 particles. We investigated and attempted to run ps_combo32 again with the number of iterations at line 89 (originally 100) reduced to 1. In this case, all three particle structures failed due to an out of memory error at 75,000 elements and 75,000,000 particles. This leads me to suspect that there is some sort of large-scale memory error in CSR or possibly the testing code. (See Below Edit)

For reference, the set of tests we were running are in the file, test_largeE_largeP.sh, located here (using the second commented-out call to ps_combo for use on AiMOS).

EDIT: Upon further inspection, this does not seem to be a memory leak. However, it is the case that CSR is using much more memory than expected. I've checked, and it seems that particles_on_process is being calculated correctly, here. I ran some performance tests on CSR using the Kokkos memory-usage tools, here with the test mpirun -np 1 ./ps_combo160 1000 1000000 1 -n 1 on a 6-GPU node on AiMOS. I found that, at their maximums, CabM uses 331.2 MB and CSR uses 470.8 MB. This is unexpected behavior because CabM should be allocating more memory through the use of padding. I think I've tracked it down to the particle_info temporary MTVs in CSR::rebuild, here, but I'm not sure how it could be allocating this much extra space.

MatthewChristoff commented 3 years ago

UPDATE: The issue was found. Because CSR uses an MTVs to store its particle data and continually makes and destroys them, these get calls were leaving a few smart pointers to the original set of data. Thus, when rebuilding, CSR was using 3x the memory of ptcl_data instead of just 2x. Currently, this has been fixed by enclosing these get calls in a for loop, thus causing these smart pointers to go out-of-scope before the call to migrate/rebuild.

A general fix has been proposed and is currently underway whereby a second copy of ptcl_data would be stored at all times for swapping purposes (like SCS) for both CSR and CabanaM.

MatthewChristoff commented 3 years ago

Once CSR has its swapping implementation done, we could probably close this issue, although the issue is still technically there for cases in which CSR increases in size so that it triggers a full rebuild.