Closed MatthewChristoff closed 3 years ago
I've tracked down the CSR error to this CreateViews
call for received particles, here. Not sure why we're running out of memory, although it might have to do with the fact that CSR swap isn't implemented on this branch. It could also be an excessive use of smart pointers that aren't properly deallocated. Otherwise, we might have a memory leak somewhere.
It's strange that this is only happening for CSR.
Oh, it looks like SCS
is also failing the sendToOne
test... something is almost definitely wrong with that one.
20: migrateSendToOne scs_C32_SMAX_V1024, rank 3
20: migrateSendToOne scs_C32_SMAX_V1024, rank 0
20: migrateSendToOne scs_C32_SMAX_V1024, rank 2
20: migrateSendToOne scs_C32_SMAX_V1024, rank 1
20: scs_C32_SMAX_V1024 Rank 3 has incorrect number of particles (25 != 24)
20: scs_C32_SMAX_V1024 Rank 0 has incorrect number of particles (25 != 25)
20: scs_C32_SMAX_V1024 Rank 2 has incorrect number of particles (25 != 24)
20: scs_C32_SMAX_V1024 Rank 1 has incorrect number of particles (25 != 24)
It looks like the test failure with migrateSendToOne
is caused because not all elements have the same number of particles and the test isn't particularly large, which is what the test expects to be passed in. I probably have change the test slightly to accommodate my changes.
I fixed the issue with migrateSendToOne
by adding a ceil
around the calculation for the supposed number of particles. The issue was that with a number of particles that wasn't a multiple of 100, that calculation wasn't being done correctly.
Alright, I've found the issue. I didn't realize that the View num_recv_particles
needed to be initialized, so I'd added some bad Kokkos::ViewAllocateWithoutInitializing
calls. I fixed it in all 3 migrate functions. Weird that it didn't show up until now.
Here's my commit for the fix: https://github.com/SCOREC/pumi-pic/commit/52c28ba3353af8f2d499a19e31e163a24aed3cac
These failures are happening during a "new" migration test. I noticed these earlier but hadn't gotten around to looking into them. Technically, they were old tests that never had their function call uncommented. I renamed the test
migrateSendRight
, and it is now located here (I moved another commented-out migration test heremigrateToEmptyAndRefill
along with another test we made more recentlymigrateSendToOne
).Specifically,
CSR
causes anout-of-memory
error which happens during the second part of this test where aDistributor
is used in the process of reversing amigrate
operation with anothermigrate
operation. This section is here.Error on
test_structures_small_4
:Error on
test_structures_4
:Secondly, I also just noticed that if this section is commented out, ignoring that error, we seemingly have yet another issue where
CabM
does not have the expected number of particles in each rank in the testsSendToOne
andmigrateToEmptyAndRefill
, here and here. This issue, at least withSendToOne
, didn't occur in the original test when it was in it's own separate file, so it likely has something to do with the structure being passed around in these tests.Error on
test_structures_small_4
:Error on
test_structures_4
:Both of these issues only occur during true migration by which I mean with multiple processes (otherwise migration doesn't happen and
migrate
just callsrebuild
). I haven't found the source of these errors/bugs yet. They could be in the tests themselves.Originally posted by @MatthewChristoff in https://github.com/SCOREC/pumi-pic/pull/63#issuecomment-813624294