SCOREC / pumi-pic

support libraries for unstructured mesh particle in cell simulations on GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
36 stars 15 forks source link

Tests failed enabling Cabana particle structure #84

Closed zhangchonglin closed 1 year ago

zhangchonglin commented 2 years ago

On a RHEL system with gcc 7.3.1 and cuda 10.2.89, building pumi-pic without enabling Cabana particle structure, all 50 unit tests passed; while building pumi-pic with enabling Cabana particle structure using -DENABLE_CABANA=ON, 4 of the 50 tests failed:

The following tests FAILED:
     17 - test_structures_small (Failed)
     19 - test_structures_large (Failed)
     20 - test_structures_small_4 (Failed)
     21 - test_structures_4 (Failed)

This may suggest a bug in the Cabana particle structure related code. The stack trace of unit test test_structures_small is shown below:

#0  0x00007f07d2e96387 in raise () from /usr/lib64/libc.so.6
#1  0x00007f07d2e97a78 in abort () from /usr/lib64/libc.so.6
#2  0x00007f07d37a6a95 in __gnu_cxx::__verbose_terminate_handler() () from /usr/lib64/libstdc++.so.6
#3  0x00007f07d37a4a06 in ?? () from /usr/lib64/libstdc++.so.6
#4  0x00007f07d37a4a33 in std::terminate() () from /usr/lib64/libstdc++.so.6
#5  0x00007f07d37a4c53 in __cxa_throw () from /usr/lib64/libstdc++.so.6
#6  0x00007f07d5f4ed5d in Kokkos::Impl::throw_runtime_exception (
    msg="cudaMemcpy(dst, src, n, cudaMemcpyDefault) error( cudaErrorIllegalAddress): an illegal memory access was encountered /hdds1/RPI/pumi-pic/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp:100")
    at /hdds1/RPI/pumi-pic/kokkos/core/src/impl/Kokkos_Error.cpp:72
#7  0x00007f07d5f5d569 in Kokkos::Impl::cuda_internal_error_throw (e=cudaErrorIllegalAddress, name=<optimized out>, file=0x7f07d5f68a90 "/hdds1/RPI/pumi-pic/kokkos/core/src/Cuda/Kokkos_CudaSpace.cpp", line=100)
    at /hdds1/RPI/pumi-pic/kokkos/core/src/Cuda/Kokkos_Cuda_Instance.cpp:154
#8  0x00000000013b2066 in Kokkos::Impl::DeepCopy<Kokkos::HostSpace, Kokkos::CudaSpace, Kokkos::Serial>::DeepCopy(void*, void const*, unsigned long) ()
#9  0x0000000001352beb in void Kokkos::deep_copy<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >(Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >::non_const_value_type&, Kokkos::View<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> > const&, std::enable_if<std::is_same<Kokkos::ViewTraits<int, Kokkos::LayoutLeft, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace>, Kokkos::MemoryTraits<0u> >::specialize, void>::value, void>::type*) ()
#10 0x00000000012c2f71 in int pumipic::getLastValue<int, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >(Kokkos::View<int*, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >) ()
#11 0x000000000127fd10 in rebuildNoChanges(char const*, pumipic::ParticleStructure<pumipic::MemberTypes<int, double [3], short, int>, Kokkos::CudaSpace>*) ()
#12 0x000000000127dcbd in testRebuild(char const*, pumipic::ParticleStructure<pumipic::MemberTypes<int, double [3], short, int>, Kokkos::CudaSpace>*) ()
#13 0x000000000127cf25 in main ()

From the stack trace, the location of the crash is pointing to the following line: https://github.com/SCOREC/pumi-pic/blob/4392873e87af1fd846fb643821054db904643d7b/particle_structs/test/test_rebuild.cpp#L49

It's likely the issue is in the following kernel before the above line: https://github.com/SCOREC/pumi-pic/blob/4392873e87af1fd846fb643821054db904643d7b/particle_structs/test/test_rebuild.cpp#L36-L48

zhangchonglin commented 1 year ago

While building with gcc 7.3.1 and cuda 11.7 and enabling cabana, I saw the following 5 tests failed (one more compared to the original issue):

The following tests FAILED:
     17 - test_structures_small (Failed)
     18 - test_structures_medium (Failed)
     19 - test_structures_large (Failed)
     20 - test_structures_small_4 (Failed)
     21 - test_structures_4 (Failed)

From the log file of 17 - test_structures_small, this seems to be due to the following two lines:

[ERROR] Memory usage changed during structure dps| Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
[ERROR] Memory usage changed | Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
17/52 Testing: test_structures_small                                            
17/52 Test: test_structures_small                                               
Command: "/hdds1/mpich/mpich-3.3.2-install/bin/mpirun" "-np" "1" "./test_structure" "small_ptcls_e5_p25_r0"
Directory: /hdds1/RPI/pumi-pic/install_cuda11.7/pumi-pic/build/particle_structs/test
"test_structures_small" start time: Jun 20 10:04 EDT                            
Output:                                                                         
----------------------------------------------------------                      
CHECK: 1                                                                        
Building SCS with C: 5 sigma: 5 V: 1024                                         
testCounts scs_C32_SMAX_V1024                                                   
testParticleExistence scs_C32_SMAX_V1024, rank 0                                
setValues scs_C32_SMAX_V1024, rank 0                                            
Time to set values scs_C32_SMAX_V1024 : 0.000084                                
pseudoPush scs_C32_SMAX_V1024, rank 0                                           
elements : 5                                                                    
parent elm data size : 5                                                        
Time for math Ops on scs_C32_SMAX_V1024 : 0.000129                              
testMetrics scs_C32_SMAX_V1024, rank 0                                          
Metrics 0, C 5, V 1024, sigma 5                                                 
Nelems 5, Nchunks 1, Nslices 1, Nptcls 25, Capacity 25, Allocation 52           
Padded Cells <Tot %> 0 0.000                                                    
Padded Slices <Tot %> 0 0.000                                                   
Empty Rows <Tot %> 0 0.000                                                      

testRebuild scs_C32_SMAX_V1024, rank 0                                          
rebuildNoChanges scs_C32_SMAX_V1024, rank 0                                     
rebuildNewElems scs_C32_SMAX_V1024, rank 0                                      
rebuildNewPtcls scs_C32_SMAX_V1024, rank 0                                      
rebuildPtclsDestroyed scs_C32_SMAX_V1024, rank 0                                
rebuildNewAndDestroyed scs_C32_SMAX_V1024, rank 0                               
testMigration scs_C32_SMAX_V1024, rank 0                                        
migrateSendRight scs_C32_SMAX_V1024, rank 0                                     
migrateSendRight (Reverse) scs_C32_SMAX_V1024, rank 0                           
migrateSendToOne scs_C32_SMAX_V1024, rank 0                                     
testCopy scs_C32_SMAX_V1024                                                     
testSegmentComp scs_C32_SMAX_V1024, rank 0                                      
migrateToEmptyAndRefill scs_C32_SMAX_V1024, rank 0                              
Building SCS with C: 5 sigma: 1 V: 10                                           
testCounts scs_C32_S1_V10                                                       
testParticleExistence scs_C32_S1_V10, rank 0                                    
setValues scs_C32_S1_V10, rank 0                                                
Time to set values scs_C32_S1_V10 : 0.000035                                    
pseudoPush scs_C32_S1_V10, rank 0                                               
elements : 5                                                                    
parent elm data size : 5                                                        
Time for math Ops on scs_C32_S1_V10 : 0.000071                                  
testMetrics scs_C32_S1_V10, rank 0                                              
Metrics 0, C 5, V 10, sigma 1                                                   
Nelems 5, Nchunks 1, Nslices 1, Nptcls 25, Capacity 25, Allocation 52           
Padded Cells <Tot %> 0 0.000                                                    
Padded Slices <Tot %> 0 0.000                                                   
Empty Rows <Tot %> 0 0.000                                        

testRebuild scs_C32_S1_V10, rank 0                                              
rebuildNoChanges scs_C32_S1_V10, rank 0                                         
rebuildNewElems scs_C32_S1_V10, rank 0                                          
rebuildNewPtcls scs_C32_S1_V10, rank 0                                          
rebuildPtclsDestroyed scs_C32_S1_V10, rank 0                                    
rebuildNewAndDestroyed scs_C32_S1_V10, rank 0                                   
testMigration scs_C32_S1_V10, rank 0                                            
migrateSendRight scs_C32_S1_V10, rank 0                                         
migrateSendRight (Reverse) scs_C32_S1_V10, rank 0                               
migrateSendToOne scs_C32_S1_V10, rank 0                                         
testCopy scs_C32_S1_V10                                                         
testSegmentComp scs_C32_S1_V10, rank 0                                          
migrateToEmptyAndRefill scs_C32_S1_V10, rank 0                                  
Building CSR                                                                    
initializing CSR data                                                           
testCounts csr                                                                  
testParticleExistence csr, rank 0                                               
setValues csr, rank 0                                                           
Time to set values csr : 0.000057                                               
pseudoPush csr, rank 0                                                          
elements : 5                                                                    
parent elm data size : 5                                                        
Time for math Ops on csr : 0.000074                                             
testMetrics csr, rank 0                                                         
Metrics (Rank 0)                                                                
Number of Elements 5, Number of Particles 25, Capacity 26                       

testRebuild csr, rank 0                                                         
rebuildNoChanges csr, rank 0                                                    
rebuildNewElems csr, rank 0                                                     
rebuildNewPtcls csr, rank 0                                                     
rebuildPtclsDestroyed csr, rank 0                                               
rebuildNewAndDestroyed csr, rank 0                                              
testMigration csr, rank 0                                                       
migrateSendRight csr, rank 0                                                    
migrateSendRight (Reverse) csr, rank 0                                          
migrateSendToOne csr, rank 0                                                    
testCopy csr                                                                    
testSegmentComp csr, rank 0                                                     
migrateToEmptyAndRefill csr, rank 0                                             
building CabM                                                                   
initializing CabM data                                                          
testCounts cabm                                                                 
testParticleExistence cabm, rank 0                                              
setValues cabm, rank 0                                                          
Time to set values cabm : 0.000067                                              
pseudoPush cabm, rank 0                                                         
elements : 5                                                                    
parent elm data size : 5                                                        
Time for math Ops on cabm : 0.000070                                            
testMetrics cabm, rank 0                                                        
Metrics (Rank 0)                                                                
Number of Elements 5, Number of SoA 6, Number of Particles 25, Capacity 192     
Padded Cells <Tot %> 167 86.979%                                                
Empty Elements <Tot %> 0 0.000%

testRebuild cabm, rank 0                                                        
rebuildNoChanges cabm, rank 0                                                   
rebuildNewElems cabm, rank 0                                                    
rebuildNewPtcls cabm, rank 0                                                    
rebuildPtclsDestroyed cabm, rank 0                                              
rebuildNewAndDestroyed cabm, rank 0                                             
testMigration cabm, rank 0                                                      
migrateSendRight cabm, rank 0                                                   
migrateSendRight (Reverse) cabm, rank 0                                         
migrateSendToOne cabm, rank 0                                                   
testCopy cabm                                                                   
testSegmentComp cabm, rank 0                                                    
migrateToEmptyAndRefill cabm, rank 0                                            
building DPS                                                                    
initializing DPS data                                                           
testCounts dps                                                                  
testParticleExistence dps, rank 0                                               
setValues dps, rank 0                                                           
Time to set values dps : 0.000052                                               
pseudoPush dps, rank 0                                                          
elements : 5                                                                    
parent elm data size : 5                                                        
Time for math Ops on dps : 0.000065                                             
testMetrics dps, rank 0                                                         
Metrics (Rank 0)                                                                
Number of Elements 5, Number of SoA 2, Number of Particles 25, Capacity 64      
Padded Cells <Tot %> 39 60.938%                                                 

testRebuild dps, rank 0                                                         
rebuildNoChanges dps, rank 0                                                    
rebuildNewElems dps, rank 0                                                     
rebuildNewPtcls dps, rank 0                                                     
rebuildPtclsDestroyed dps, rank 0                                               
rebuildNewAndDestroyed dps, rank 0                                              
testMigration dps, rank 0                                                       
migrateSendRight dps, rank 0                                                    
migrateSendRight (Reverse) dps, rank 0                                          
migrateSendToOne dps, rank 0                                                    
testCopy dps                                                                    
testSegmentComp dps, rank 0                                                     
migrateToEmptyAndRefill dps, rank 0                                             
[ERROR] Memory usage changed during structure dps| Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
[ERROR] Memory usage changed | Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
2 tests failed                                                                  
<end of output>                                                                 
Test time =   0.18 sec                                                          
----------------------------------------------------------                      
Test Failed.                                                                    
"test_structures_small" end time: Jun 20 10:04 EDT                              
"test_structures_small" time elapsed: 00:00:00
zhangchonglin commented 1 year ago

Similarly, all other failed tests have the same two lines.

For example, test_structures_large:

[ERROR] Memory usage changed during structure dps| Initial: 0.409729 GB | Final: 0.411682 GB | Diff: 0.001953 GB
[ERROR] Memory usage changed | Initial: 0.372620 GB | Final: 0.374573 GB | Diff: 0.001953 GB
zhangchonglin commented 1 year ago

Using SCS particle structure, the following tests were failing:

The following tests FAILED:
     16 - test_structures_large (Failed)
     17 - test_structures_small_4 (Failed)
     18 - test_structures_4 (Failed)
     19 - test_structures_empty (Failed)
     20 - test_structures_noptcls (Failed)

The log file is similar:

[ERROR] Memory usage changed during structure csr| Initial: 0.490845 GB | Final: 0.474121 GB | Diff: -0.016724 GB
[ERROR] Memory usage changed | Initial: 0.432007 GB | Final: 0.437012 GB | Diff: 0.005005 GB
Angelyr commented 1 year ago

Pumi-pic now does memory testing using valgrind. So this will no longer be an issue.