SCOREC / pumi-pic

support libraries for unstructured mesh particle in cell simulations on GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
36 stars 15 forks source link

CSR and CabM failing migrate tests #64

Closed MatthewChristoff closed 3 years ago

MatthewChristoff commented 3 years ago

The following two tests fail with Cabana enabled and Cabana disabled:

The following tests FAILED:
         20 - test_structures_small_4 (Failed)
         21 - test_structures_4 (Failed)

These failures are happening during a "new" migration test. I noticed these earlier but hadn't gotten around to looking into them. Technically, they were old tests that never had their function call uncommented. I renamed the test migrateSendRight, and it is now located here (I moved another commented-out migration test here migrateToEmptyAndRefill along with another test we made more recently migrateSendToOne).

Specifically, CSR causes an out-of-memory error which happens during the second part of this test where a Distributor is used in the process of reversing a migrate operation with another migrate operation. This section is here.

Error on test_structures_small_4:

20: testMigration csr, rank 2
20: terminate called after throwing an instance of 'std::runtime_error'
20:   what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /users/chrism5/pumipic_Cabm/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
20: Traceback functionality not available
20: 
20: terminate called after throwing an instance of 'std::runtime_error'
20:   what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /users/chrism5/pumipic_Cabm/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
20: Traceback functionality not available
20: 
20: terminate called after throwing an instance of 'std::runtime_error'
20:   what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /users/chrism5/pumipic_Cabm/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
20: Traceback functionality not available
20: 
20: terminate called after throwing an instance of 'std::runtime_error'
20:   what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /users/chrism5/pumipic_Cabm/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
20: Traceback functionality not available

Error on test_structures_4:

21: testMigration csr, rank 0
21: terminate called after throwing an instance of 'std::runtime_error'
21:   what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /users/chrism5/pumipic_Cabm/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
21: Traceback functionality not available
21: 
21: terminate called after throwing an instance of 'std::runtime_error'
21:   what():  cudaMalloc( &ptr, arg_alloc_size ) error( cudaErrorMemoryAllocation): out of memory /users/chrism5/pumipic_Cabm/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:175
21: Traceback functionality not available

Secondly, I also just noticed that if this section is commented out, ignoring that error, we seemingly have yet another issue where CabM does not have the expected number of particles in each rank in the tests SendToOne and migrateToEmptyAndRefill, here and here. This issue, at least with SendToOne, didn't occur in the original test when it was in it's own separate file, so it likely has something to do with the structure being passed around in these tests.

Error on test_structures_small_4:

20: testMigration cabm, rank 2
20: cabm Rank 1 has incorrect number of particles (24 != 23)
20: testSegmentComp cabm, rank 1
20: migrateToEmptyAndRefill cabm, rank 1
20: cabm Rank 3 has incorrect number of particles (26 != 25)
20: cabm Rank 0 has incorrect number of particles (26 != 26)
20: cabm Rank 2 has incorrect number of particles (24 != 23)

Error on test_structures_4:

21: testMigration cabm, rank 0
21: cabm Rank 3 has incorrect number of particles (9900 != 9899)
...
21: migrateToEmptyAndRefill cabm, rank 3
21: migrateToEmptyAndRefill cabm, rank 2
21: migrateToEmptyAndRefill cabm, rank 1
21: cabm Rank 0 has incorrect number of particles (10300 != 10301)

Both of these issues only occur during true migration by which I mean with multiple processes (otherwise migration doesn't happen and migrate just calls rebuild). I haven't found the source of these errors/bugs yet. They could be in the tests themselves.

Originally posted by @MatthewChristoff in https://github.com/SCOREC/pumi-pic/pull/63#issuecomment-813624294

MatthewChristoff commented 3 years ago

I've tracked down the CSR error to this CreateViews call for received particles, here. Not sure why we're running out of memory, although it might have to do with the fact that CSR swap isn't implemented on this branch. It could also be an excessive use of smart pointers that aren't properly deallocated. Otherwise, we might have a memory leak somewhere.

It's strange that this is only happening for CSR.

https://github.com/SCOREC/pumi-pic/blob/926188ae560b8554877a4287e7c1b5455aad8365/particle_structs/src/csr/CSR_migrate.hpp#L144

MatthewChristoff commented 3 years ago

Oh, it looks like SCS is also failing the sendToOne test... something is almost definitely wrong with that one.

20: migrateSendToOne scs_C32_SMAX_V1024, rank 3
20: migrateSendToOne scs_C32_SMAX_V1024, rank 0
20: migrateSendToOne scs_C32_SMAX_V1024, rank 2
20: migrateSendToOne scs_C32_SMAX_V1024, rank 1
20: scs_C32_SMAX_V1024 Rank 3 has incorrect number of particles (25 != 24)
20: scs_C32_SMAX_V1024 Rank 0 has incorrect number of particles (25 != 25)
20: scs_C32_SMAX_V1024 Rank 2 has incorrect number of particles (25 != 24)
20: scs_C32_SMAX_V1024 Rank 1 has incorrect number of particles (25 != 24)
MatthewChristoff commented 3 years ago

It looks like the test failure with migrateSendToOne is caused because not all elements have the same number of particles and the test isn't particularly large, which is what the test expects to be passed in. I probably have change the test slightly to accommodate my changes.

MatthewChristoff commented 3 years ago

I fixed the issue with migrateSendToOne by adding a ceil around the calculation for the supposed number of particles. The issue was that with a number of particles that wasn't a multiple of 100, that calculation wasn't being done correctly.

MatthewChristoff commented 3 years ago

Alright, I've found the issue. I didn't realize that the View num_recv_particles needed to be initialized, so I'd added some bad Kokkos::ViewAllocateWithoutInitializing calls. I fixed it in all 3 migrate functions. Weird that it didn't show up until now.

Here's my commit for the fix: https://github.com/SCOREC/pumi-pic/commit/52c28ba3353af8f2d499a19e31e163a24aed3cac