Closed zhangchonglin closed 1 year ago
While building with gcc 7.3.1
and cuda 11.7
and enabling cabana
, I saw the following 5 tests failed (one more compared to the original issue):
The following tests FAILED:
17 - test_structures_small (Failed)
18 - test_structures_medium (Failed)
19 - test_structures_large (Failed)
20 - test_structures_small_4 (Failed)
21 - test_structures_4 (Failed)
From the log file of 17 - test_structures_small
, this seems to be due to the following two lines:
[ERROR] Memory usage changed during structure dps| Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
[ERROR] Memory usage changed | Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
17/52 Testing: test_structures_small
17/52 Test: test_structures_small
Command: "/hdds1/mpich/mpich-3.3.2-install/bin/mpirun" "-np" "1" "./test_structure" "small_ptcls_e5_p25_r0"
Directory: /hdds1/RPI/pumi-pic/install_cuda11.7/pumi-pic/build/particle_structs/test
"test_structures_small" start time: Jun 20 10:04 EDT
Output:
----------------------------------------------------------
CHECK: 1
Building SCS with C: 5 sigma: 5 V: 1024
testCounts scs_C32_SMAX_V1024
testParticleExistence scs_C32_SMAX_V1024, rank 0
setValues scs_C32_SMAX_V1024, rank 0
Time to set values scs_C32_SMAX_V1024 : 0.000084
pseudoPush scs_C32_SMAX_V1024, rank 0
elements : 5
parent elm data size : 5
Time for math Ops on scs_C32_SMAX_V1024 : 0.000129
testMetrics scs_C32_SMAX_V1024, rank 0
Metrics 0, C 5, V 1024, sigma 5
Nelems 5, Nchunks 1, Nslices 1, Nptcls 25, Capacity 25, Allocation 52
Padded Cells <Tot %> 0 0.000
Padded Slices <Tot %> 0 0.000
Empty Rows <Tot %> 0 0.000
testRebuild scs_C32_SMAX_V1024, rank 0
rebuildNoChanges scs_C32_SMAX_V1024, rank 0
rebuildNewElems scs_C32_SMAX_V1024, rank 0
rebuildNewPtcls scs_C32_SMAX_V1024, rank 0
rebuildPtclsDestroyed scs_C32_SMAX_V1024, rank 0
rebuildNewAndDestroyed scs_C32_SMAX_V1024, rank 0
testMigration scs_C32_SMAX_V1024, rank 0
migrateSendRight scs_C32_SMAX_V1024, rank 0
migrateSendRight (Reverse) scs_C32_SMAX_V1024, rank 0
migrateSendToOne scs_C32_SMAX_V1024, rank 0
testCopy scs_C32_SMAX_V1024
testSegmentComp scs_C32_SMAX_V1024, rank 0
migrateToEmptyAndRefill scs_C32_SMAX_V1024, rank 0
Building SCS with C: 5 sigma: 1 V: 10
testCounts scs_C32_S1_V10
testParticleExistence scs_C32_S1_V10, rank 0
setValues scs_C32_S1_V10, rank 0
Time to set values scs_C32_S1_V10 : 0.000035
pseudoPush scs_C32_S1_V10, rank 0
elements : 5
parent elm data size : 5
Time for math Ops on scs_C32_S1_V10 : 0.000071
testMetrics scs_C32_S1_V10, rank 0
Metrics 0, C 5, V 10, sigma 1
Nelems 5, Nchunks 1, Nslices 1, Nptcls 25, Capacity 25, Allocation 52
Padded Cells <Tot %> 0 0.000
Padded Slices <Tot %> 0 0.000
Empty Rows <Tot %> 0 0.000
testRebuild scs_C32_S1_V10, rank 0
rebuildNoChanges scs_C32_S1_V10, rank 0
rebuildNewElems scs_C32_S1_V10, rank 0
rebuildNewPtcls scs_C32_S1_V10, rank 0
rebuildPtclsDestroyed scs_C32_S1_V10, rank 0
rebuildNewAndDestroyed scs_C32_S1_V10, rank 0
testMigration scs_C32_S1_V10, rank 0
migrateSendRight scs_C32_S1_V10, rank 0
migrateSendRight (Reverse) scs_C32_S1_V10, rank 0
migrateSendToOne scs_C32_S1_V10, rank 0
testCopy scs_C32_S1_V10
testSegmentComp scs_C32_S1_V10, rank 0
migrateToEmptyAndRefill scs_C32_S1_V10, rank 0
Building CSR
initializing CSR data
testCounts csr
testParticleExistence csr, rank 0
setValues csr, rank 0
Time to set values csr : 0.000057
pseudoPush csr, rank 0
elements : 5
parent elm data size : 5
Time for math Ops on csr : 0.000074
testMetrics csr, rank 0
Metrics (Rank 0)
Number of Elements 5, Number of Particles 25, Capacity 26
testRebuild csr, rank 0
rebuildNoChanges csr, rank 0
rebuildNewElems csr, rank 0
rebuildNewPtcls csr, rank 0
rebuildPtclsDestroyed csr, rank 0
rebuildNewAndDestroyed csr, rank 0
testMigration csr, rank 0
migrateSendRight csr, rank 0
migrateSendRight (Reverse) csr, rank 0
migrateSendToOne csr, rank 0
testCopy csr
testSegmentComp csr, rank 0
migrateToEmptyAndRefill csr, rank 0
building CabM
initializing CabM data
testCounts cabm
testParticleExistence cabm, rank 0
setValues cabm, rank 0
Time to set values cabm : 0.000067
pseudoPush cabm, rank 0
elements : 5
parent elm data size : 5
Time for math Ops on cabm : 0.000070
testMetrics cabm, rank 0
Metrics (Rank 0)
Number of Elements 5, Number of SoA 6, Number of Particles 25, Capacity 192
Padded Cells <Tot %> 167 86.979%
Empty Elements <Tot %> 0 0.000%
testRebuild cabm, rank 0
rebuildNoChanges cabm, rank 0
rebuildNewElems cabm, rank 0
rebuildNewPtcls cabm, rank 0
rebuildPtclsDestroyed cabm, rank 0
rebuildNewAndDestroyed cabm, rank 0
testMigration cabm, rank 0
migrateSendRight cabm, rank 0
migrateSendRight (Reverse) cabm, rank 0
migrateSendToOne cabm, rank 0
testCopy cabm
testSegmentComp cabm, rank 0
migrateToEmptyAndRefill cabm, rank 0
building DPS
initializing DPS data
testCounts dps
testParticleExistence dps, rank 0
setValues dps, rank 0
Time to set values dps : 0.000052
pseudoPush dps, rank 0
elements : 5
parent elm data size : 5
Time for math Ops on dps : 0.000065
testMetrics dps, rank 0
Metrics (Rank 0)
Number of Elements 5, Number of SoA 2, Number of Particles 25, Capacity 64
Padded Cells <Tot %> 39 60.938%
testRebuild dps, rank 0
rebuildNoChanges dps, rank 0
rebuildNewElems dps, rank 0
rebuildNewPtcls dps, rank 0
rebuildPtclsDestroyed dps, rank 0
rebuildNewAndDestroyed dps, rank 0
testMigration dps, rank 0
migrateSendRight dps, rank 0
migrateSendRight (Reverse) dps, rank 0
migrateSendToOne dps, rank 0
testCopy dps
testSegmentComp dps, rank 0
migrateToEmptyAndRefill dps, rank 0
[ERROR] Memory usage changed during structure dps| Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
[ERROR] Memory usage changed | Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
2 tests failed
<end of output>
Test time = 0.18 sec
----------------------------------------------------------
Test Failed.
"test_structures_small" end time: Jun 20 10:04 EDT
"test_structures_small" time elapsed: 00:00:00
Similarly, all other failed tests have the same two lines.
For example, test_structures_large
:
[ERROR] Memory usage changed during structure dps| Initial: 0.409729 GB | Final: 0.411682 GB | Diff: 0.001953 GB
[ERROR] Memory usage changed | Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
Using SCS
particle structure, the following tests were failing:
The following tests FAILED:
16 - test_structures_large (Failed)
17 - test_structures_small_4 (Failed)
18 - test_structures_4 (Failed)
19 - test_structures_empty (Failed)
20 - test_structures_noptcls (Failed)
The log file is similar:
[ERROR] Memory usage changed during structure csr| Initial: 0.490845 GB | Final: 0.474121 GB | Diff: -0.016724 GB
[ERROR] Memory usage changed | Initial: 0.432007 GB | Final: 0.437012 GB | Diff: 0.005005 GB
Pumi-pic now does memory testing using valgrind. So this will no longer be an issue.
On a RHEL system with
gcc 7.3.1
andcuda 10.2.89
, buildingpumi-pic
without enablingCabana
particle structure, all 50 unit tests passed; while buildingpumi-pic
with enablingCabana
particle structure using-DENABLE_CABANA=ON
, 4 of the 50 tests failed:This may suggest a bug in the
Cabana
particle structure related code. The stack trace of unit testtest_structures_small
is shown below:From the stack trace, the location of the crash is pointing to the following line: https://github.com/SCOREC/pumi-pic/blob/4392873e87af1fd846fb643821054db904643d7b/particle_structs/test/test_rebuild.cpp#L49
It's likely the issue is in the following kernel before the above line: https://github.com/SCOREC/pumi-pic/blob/4392873e87af1fd846fb643821054db904643d7b/particle_structs/test/test_rebuild.cpp#L36-L48