SCOREC / pumi-pic

support libraries for unstructured mesh particle in cell simulations on GPUs and CPUs
BSD 3-Clause "New" or "Revised" License
36 stars 14 forks source link

Significant slowdown in PUMIPic tests using kokkos4.2.00, cuda11.7 and latest PUMIPic and omega_h commits #117

Closed zhangchonglin closed 7 months ago

zhangchonglin commented 8 months ago

Configure and build script of PUMIPic:

cuda=/usr/local/cuda-11.7                                                       
export PATH=$cuda/bin:$PATH                                                     
export LD_LIBRARY_PATH=$cuda/lib64:$LD_LIBRARY_PATH                             
export installroot=$PWD                                                         
export srcroot=$installroot/../                                                     

# kokkos                                                                        
export kk=$installroot/kokkos/install                                           
export kksrc=$srcroot/kokkos                                                    

# omega_h                                                                       
export oh=$installroot/omega_h/install                                          

#EnGPar                                                                         
export EnGPar=$installroot/EnGPar/install                                       

# Cabana                                                                        
export cabana=$installroot/cabana/install                                       

# pumi-pic                                                                      
export pumipicsrc=$srcroot/pumi-pic                                             
export testdir=$pumipicsrc/pumipic-data                                         
export pumipic=$installroot/pumi-pic/install                                    

export CMAKE_PREFIX_PATH=$kk:$oh:$EnGPar:$cabana:$CMAKE_PREFIX_PATH             

cd $installroot                                                                 
mkdir -p pumi-pic/build                                                         
cd pumi-pic/build                                                               

cmake $pumipicsrc -DCMAKE_BUILD_TYPE=Release \                                  
                  -DCMAKE_CXX_COMPILER=mpicxx \                                 
                  -DIS_TESTING=ON \                                             
                  -DENABLE_CABANA=OFF \                                         
                  -DBUILD_SHARED_LIBS=OFF \                                     
                  -DPS_IS_TESTING=ON \                                          
                  -DCMAKE_INSTALL_PREFIX=$pumipic \                             
                  -DCMAKE_CXX_FLAGS="-fPIC" \                                   
                  -DTEST_DATA_DIR=$testdir                                      

make -j4 install                                                                
ctest   

Below are the complete log from the tests

100% tests passed, 0 tests failed out of 52

Total Test time (real) = 30.23 sec


- ##### test time using `kokkos 4.2.00`
  Start  1: viewComm_1

1/57 Test #1: viewComm_1 ....................... Passed 0.16 sec Start 2: viewComm_2 2/57 Test #2: viewComm_2 ....................... Passed 0.18 sec Start 3: viewComm_4 3/57 Test #3: viewComm_4 ....................... Passed 0.26 sec Start 4: type_test 4/57 Test #4: type_test ........................ Passed 0.15 sec Start 5: sort_test 5/57 Test #5: sort_test ........................ Passed 0.13 sec Start 6: scanTest 6/57 Test #6: scanTest ......................... Passed 0.13 sec Start 7: view_test 7/57 Test #7: view_test ........................ Passed 0.09 sec Start 8: initParticles 8/57 Test #8: initParticles .................... Passed 0.15 sec Start 9: buildSCS 9/57 Test #9: buildSCS ......................... Passed 0.17 sec Start 10: scs_padding 10/57 Test #10: scs_padding ...................... Passed 0.25 sec Start 11: lambdaTest 11/57 Test #11: lambdaTest ....................... Passed 0.15 sec Start 12: write_ptcl_small 12/57 Test #12: write_ptcl_small ................. Passed 0.16 sec Start 13: write_ptcl_small_4 13/57 Test #13: write_ptcl_small_4 ............... Passed 0.20 sec Start 14: write_ptcl_4 14/57 Test #14: write_ptcl_4 ..................... Passed 0.22 sec Start 15: write_ptcl_empty 15/57 Test #15: write_ptcl_empty ................. Passed 0.18 sec Start 16: write_ptcl_noptcls 16/57 Test #16: write_ptcl_noptcls ............... Passed 0.18 sec Start 17: write_ptcl_medium 17/57 Test #17: write_ptcl_medium ................ Passed 0.14 sec Start 18: write_ptcl_large 18/57 Test #18: write_ptcl_large ................. Passed 0.58 sec Start 19: test_structures_small 19/57 Test #19: test_structures_small ............ Passed 0.32 sec Start 20: test_structures_medium 20/57 Test #20: test_structures_medium ........... Passed 0.32 sec Start 21: test_structures_large 21/57 Test #21: test_structures_large ............ Passed 1.44 sec Start 22: test_structures_small_4 22/57 Test #22: test_structures_small_4 .......... Passed 1.02 sec Start 23: test_structures_4 23/57 Test #23: test_structures_4 ................ Passed 1.20 sec Start 24: test_structures_empty 24/57 Test #24: test_structures_empty ............ Passed 0.58 sec Start 25: test_structures_noptcls 25/57 Test #25: test_structures_noptcls .......... Passed 0.61 sec Start 26: destroy_test 26/57 Test #26: destroy_test ..................... Passed 1.52 sec Start 27: barycentric_3 27/57 Test #27: barycentric_3 .................... Passed 0.14 sec Start 28: test_adj_2d 28/57 Test #28: test_adj_2d ...................... Passed 0.66 sec Start 29: test_adj_3d 29/57 Test #29: test_adj_3d ...................... Passed 2.01 sec Start 30: search2d 30/57 Test #30: search2d ......................... Passed 0.45 sec Start 31: print_partition_cube_2 31/57 Test #31: print_partition_cube_2 ........... Passed 0.62 sec Start 32: ptn_loading_cube 32/57 Test #32: ptn_loading_cube ................. Passed 0.28 sec Start 33: print_partition_cube_4 33/57 Test #33: print_partition_cube_4 ........... Passed 0.46 sec Start 34: ptn_loading_cube_4 34/57 Test #34: ptn_loading_cube_4 ............... Passed 0.48 sec Start 35: print_partition_pisces_4 35/57 Test #35: print_partition_pisces_4 ......... Passed 0.51 sec Start 36: ptn_loading_pisces 36/57 Test #36: ptn_loading_pisces ............... Passed 0.49 sec Start 37: print_partition_2d_box_4 37/57 Test #37: print_partition_2d_box_4 ......... Passed 0.43 sec Start 38: ptn_loading_2d_box_4 38/57 Test #38: ptn_loading_2d_box_4 ............. Passed 0.41 sec Start 39: full_mesh_pisces 39/57 Test #39: full_mesh_pisces ................. Passed 0.81 sec Start 40: input_construct_cube 40/57 Test #40: input_construct_cube ............. Passed 0.56 sec Start 41: comm_array_pisces 41/57 Test #41: comm_array_pisces ................ Passed 0.90 sec Start 42: comm_array_2d_box 42/57 Test #42: comm_array_2d_box ................ Passed 0.68 sec Start 43: file_rw_cube_4 43/57 Test #43: file_rw_cube_4 ................... Passed 0.61 sec Start 44: file_rw_xgc_24k_1 44/57 Test #44: file_rw_xgc_24k_1 ................ Passed 0.15 sec Start 45: file_rw_xgc_24k_4 45/57 Test #45: file_rw_xgc_24k_4 ................ Passed 0.44 sec Start 46: file_rw_xgc_120k_1 46/57 Test #46: file_rw_xgc_120k_1 ............... Passed 0.28 sec Start 47: file_rw_xgc_120k_4 47/57 Test #47: file_rw_xgc_120k_4 ............... Passed 0.60 sec Start 48: lb_r1 48/57 Test #48: lb_r1 ............................ Passed 5.18 sec Start 49: lb_r4 49/57 Test #49: lb_r4 ............................ Passed 0.68 sec Start 50: pseudoPushAndSearch_t1 50/57 Test #50: pseudoPushAndSearch_t1 ........... Passed 0.78 sec Start 51: pseudoPushAndSearch_t2_r2 51/57 Test #51: pseudoPushAndSearch_t2_r2 ........ Passed 2.05 sec Start 52: pseudoPushAndSearch_cube_t1 52/57 Test #52: pseudoPushAndSearch_cube_t1 ...... Passed 0.90 sec Start 53: pseudoXGCm_scatter 53/57 Test #53: pseudoXGCm_scatter ............... Passed 0.14 sec Start 54: pseudoXGCm_24kElms 54/57 Test #54: pseudoXGCm_24kElms ............... Passed 15.10 sec Start 55: pseudoXGCm_24kElms_4 55/57 Test #55: pseudoXGCm_24kElms_4 ............. Passed 6.49 sec Start 56: pseudoXGCm_120kElms 56/57 Test #56: pseudoXGCm_120kElms .............. Passed 5.36 sec Start 57: pseudoXGCm_120kElms_4 57/57 Test #57: pseudoXGCm_120kElms_4 ............ Passed 16.31 sec

100% tests passed, 0 tests failed out of 57

Total Test time (real) = 75.46 sec

cwsmith commented 8 months ago

@zhangchonglin Thanks for the detailed report; especially running the older versions. A few comments:

zhangchonglin commented 8 months ago

@cwsmith: thanks for these comments. I agree pseudoXGCm_120kElms and pseudoPushAndSearch_t1 could be a focus since these two tests use major parts of PUMIPic and use single GPU. And Cabana should not matter here since it's essentially not used in above tests.

cwsmith commented 8 months ago

Running git bisect pointed at this commit for the performance drop in pseudoXGCm_120kElms and pseudoPushAndSearch_t1:

c17b75a9a5fde6ba815bfe68b9fac2adc64054d5 is the first bad commit
commit c17b75a9a5fde6ba815bfe68b9fac2adc64054d5
Author: Angelyr <scardking@gmail.com>
Date:   Mon Nov 20 18:34:48 2023 -0500

    fixed sigma = INT_MAX

 particle_structs/src/scs/SCS_sort.h      | 1 +
 particle_structs/test/test_structure.cpp | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)
zhangchonglin commented 8 months ago

This is interesting. There is effectively only a one line change in this commit:

    sigma = std::min(sigma, std::max(num_elems, 1));
cwsmith commented 8 months ago

Yeah. I've reverted the commit locally and retesting on perlmutter ~.... but perlmutter or the node I'm testing on seems to be acting strangely at the moment and all my runs are 'slow'. (I switched nodes and that resolved it)~ and the results look good.

@zhangchonglin Would you mind trying on your machine? This branch https://github.com/SCOREC/pumi-pic/tree/cws/perfTests has the commit reverted.

cwsmith commented 8 months ago

Here are the current test results on an NVIDIA 3060 using https://github.com/SCOREC/pumi-pic/tree/cws/perfTests:

$ cat Testing/Temporary/CTestCostData.txt
viewComm_1 1 0.160911
viewComm_2 1 0.132697
viewComm_4 1 0.220739
type_test 1 0.119557
sort_test 1 0.0989485
scanTest 1 0.0974365
view_test 1 0.0997208
initParticles 1 0.10338
buildSCS 1 0.0947061
scs_padding 1 0.119062
lambdaTest 1 0.113555
write_ptcl_small 1 0.0877834
write_ptcl_small_4 1 0.164311
write_ptcl_4 1 0.185688
write_ptcl_empty 1 0.149241
write_ptcl_noptcls 1 0.145234
write_ptcl_medium 1 0.117534
write_ptcl_large 1 0.469619
test_structures_small 1 0.17026
test_structures_medium 1 0.269958
test_structures_large 1 1.50534
test_structures_small_4 1 1.56285
test_structures_4 1 1.71577
test_structures_empty 1 0.798312
test_structures_noptcls 1 0.837134
destroy_test 1 0.402936
barycentric_3 1 0.166155
test_adj_2d 1 0.659199
test_adj_3d 1 1.7997
search2d 1 0.204503
print_partition_cube_2 1 0.263603
ptn_loading_cube 1 0.414934
print_partition_cube_4 1 0.456839
ptn_loading_cube_4 1 0.416852
print_partition_pisces_4 1 0.442497
ptn_loading_pisces 1 0.427233
print_partition_2d_box_4 1 0.374372
ptn_loading_2d_box_4 1 0.429444
full_mesh_pisces 1 0.687399
input_construct_cube 1 0.649281
comm_array_pisces 1 0.978467
comm_array_2d_box 1 0.615361
file_rw_cube_4 1 0.591496
file_rw_xgc_24k_1 1 0.136416
file_rw_xgc_24k_4 1 0.3849
file_rw_xgc_120k_1 1 0.300448
file_rw_xgc_120k_4 1 0.611995
lb_r1 1 0.145082
lb_r4 1 0.533143
pseudoPushAndSearch_t1 1 0.266662
pseudoPushAndSearch_t2_r2 1 0.81152
pseudoPushAndSearch_cube_t1 1 0.291881
pseudoXGCm_scatter 1 0.176611
pseudoXGCm_24kElms 1 0.338749
pseudoXGCm_24kElms_4 1 3.69858
pseudoXGCm_120kElms 1 0.453264
pseudoXGCm_120kElms_4 1 1.77536
zhangchonglin commented 8 months ago

@cwsmith: The above commit seems to be the cause. Do you know why Angel made that change? The timing seems reasonable. I will test XGCm on Summit to see if it's the same case.

This is the test result using kokkos 3.7.02 with old PUMIPic commit d6a53c5.

      Start  1: viewComm_1
 1/52 Test  #1: viewComm_1 .......................   Passed    0.19 sec
      Start  2: viewComm_2
 2/52 Test  #2: viewComm_2 .......................   Passed    0.17 sec
      Start  3: viewComm_4
 3/52 Test  #3: viewComm_4 .......................   Passed    0.22 sec
      Start  4: type_test
 4/52 Test  #4: type_test ........................   Passed    0.14 sec
      Start  5: view_test
 5/52 Test  #5: view_test ........................   Passed    0.10 sec
      Start  6: initParticles
 6/52 Test  #6: initParticles ....................   Passed    0.14 sec
      Start  7: buildSCS
 7/52 Test  #7: buildSCS .........................   Passed    0.11 sec
      Start  8: scs_padding
 8/52 Test  #8: scs_padding ......................   Passed    0.12 sec
      Start  9: lambdaTest
 9/52 Test  #9: lambdaTest .......................   Passed    0.13 sec
      Start 10: write_ptcl_small
10/52 Test #10: write_ptcl_small .................   Passed    0.09 sec
      Start 11: write_ptcl_small_4
11/52 Test #11: write_ptcl_small_4 ...............   Passed    0.17 sec
      Start 12: write_ptcl_4
12/52 Test #12: write_ptcl_4 .....................   Passed    0.17 sec
      Start 13: write_ptcl_empty
13/52 Test #13: write_ptcl_empty .................   Passed    0.17 sec
      Start 14: write_ptcl_noptcls
14/52 Test #14: write_ptcl_noptcls ...............   Passed    0.17 sec
      Start 15: write_ptcl_medium
15/52 Test #15: write_ptcl_medium ................   Passed    0.13 sec
      Start 16: write_ptcl_large
16/52 Test #16: write_ptcl_large .................   Passed    0.60 sec
      Start 17: test_structures_small
17/52 Test #17: test_structures_small ............   Passed    0.14 sec
      Start 18: test_structures_medium
18/52 Test #18: test_structures_medium ...........   Passed    0.23 sec
      Start 19: test_structures_large
19/52 Test #19: test_structures_large ............   Passed    1.13 sec
      Start 20: test_structures_small_4
20/52 Test #20: test_structures_small_4 ..........   Passed    0.81 sec
      Start 21: test_structures_4
21/52 Test #21: test_structures_4 ................   Passed    0.86 sec
      Start 22: test_structures_empty
22/52 Test #22: test_structures_empty ............   Passed    0.42 sec
      Start 23: test_structures_noptcls
23/52 Test #23: test_structures_noptcls ..........   Passed    0.46 sec
      Start 24: destroy_test
24/52 Test #24: destroy_test .....................   Passed    0.31 sec
      Start 25: barycentric_3
25/52 Test #25: barycentric_3 ....................   Passed    0.14 sec
      Start 26: test_adj_2d
26/52 Test #26: test_adj_2d ......................   Passed    0.61 sec
      Start 27: test_adj_3d
27/52 Test #27: test_adj_3d ......................   Passed    1.44 sec
      Start 28: search2d
28/52 Test #28: search2d .........................   Passed    0.18 sec
      Start 29: print_partition_cube_2
29/52 Test #29: print_partition_cube_2 ...........   Passed    0.27 sec
      Start 30: ptn_loading_cube
30/52 Test #30: ptn_loading_cube .................   Passed    0.25 sec
      Start 31: print_partition_cube_4
31/52 Test #31: print_partition_cube_4 ...........   Passed    0.36 sec
      Start 32: ptn_loading_cube_4
32/52 Test #32: ptn_loading_cube_4 ...............   Passed    0.33 sec
      Start 33: print_partition_pisces_4
33/52 Test #33: print_partition_pisces_4 .........   Passed    0.38 sec
      Start 34: ptn_loading_pisces
34/52 Test #34: ptn_loading_pisces ...............   Passed    0.36 sec
      Start 35: full_mesh_pisces
35/52 Test #35: full_mesh_pisces .................   Passed    0.34 sec
      Start 36: input_construct_cube
36/52 Test #36: input_construct_cube .............   Passed    0.40 sec
      Start 37: comm_array_pisces
37/52 Test #37: comm_array_pisces ................   Passed    0.57 sec
      Start 38: file_rw_cube_4
38/52 Test #38: file_rw_cube_4 ...................   Passed    0.45 sec
      Start 39: file_rw_xgc_24k_1
39/52 Test #39: file_rw_xgc_24k_1 ................   Passed    0.15 sec
      Start 40: file_rw_xgc_24k_4
40/52 Test #40: file_rw_xgc_24k_4 ................   Passed    0.32 sec
      Start 41: file_rw_xgc_120k_1
41/52 Test #41: file_rw_xgc_120k_1 ...............   Passed    0.30 sec
      Start 42: file_rw_xgc_120k_4
42/52 Test #42: file_rw_xgc_120k_4 ...............   Passed    0.46 sec
      Start 43: lb_r1
43/52 Test #43: lb_r1 ............................   Passed    0.13 sec
      Start 44: lb_r4
44/52 Test #44: lb_r4 ............................   Passed    0.43 sec
      Start 45: pseudoPushAndSearch_t1
45/52 Test #45: pseudoPushAndSearch_t1 ...........   Passed    0.27 sec
      Start 46: pseudoPushAndSearch_t2_r2
46/52 Test #46: pseudoPushAndSearch_t2_r2 ........   Passed    0.69 sec
      Start 47: pseudoPushAndSearch_cube_t1
47/52 Test #47: pseudoPushAndSearch_cube_t1 ......   Passed    0.30 sec
      Start 48: pseudoXGCm_scatter
48/52 Test #48: pseudoXGCm_scatter ...............   Passed    0.15 sec
      Start 49: pseudoXGCm_24kElms
49/52 Test #49: pseudoXGCm_24kElms ...............   Passed    0.32 sec
      Start 50: pseudoXGCm_24kElms_4
50/52 Test #50: pseudoXGCm_24kElms_4 .............   Passed    2.59 sec
      Start 51: pseudoXGCm_120kElms
51/52 Test #51: pseudoXGCm_120kElms ..............   Passed    0.35 sec
      Start 52: pseudoXGCm_120kElms_4
52/52 Test #52: pseudoXGCm_120kElms_4 ............   Passed    1.24 sec

This is the test result using kokkos 4.2.00 with newest PUMIPic commit 7b55b1b plus reverting the problematic commit.

      Start  1: viewComm_1
 1/57 Test  #1: viewComm_1 .......................   Passed    0.14 sec
      Start  2: viewComm_2
 2/57 Test  #2: viewComm_2 .......................   Passed    0.15 sec
      Start  3: viewComm_4
 3/57 Test  #3: viewComm_4 .......................   Passed    0.22 sec
      Start  4: type_test
 4/57 Test  #4: type_test ........................   Passed    0.12 sec
      Start  5: sort_test
 5/57 Test  #5: sort_test ........................   Passed    0.11 sec
      Start  6: scanTest
 6/57 Test  #6: scanTest .........................   Passed    0.14 sec
      Start  7: view_test
 7/57 Test  #7: view_test ........................   Passed    0.09 sec
      Start  8: initParticles
 8/57 Test  #8: initParticles ....................   Passed    0.10 sec
      Start  9: buildSCS
 9/57 Test  #9: buildSCS .........................   Passed    0.12 sec
      Start 10: scs_padding
10/57 Test #10: scs_padding ......................   Passed    0.12 sec
      Start 11: lambdaTest
11/57 Test #11: lambdaTest .......................   Passed    0.10 sec
      Start 12: write_ptcl_small
12/57 Test #12: write_ptcl_small .................   Passed    0.14 sec
      Start 13: write_ptcl_small_4
13/57 Test #13: write_ptcl_small_4 ...............   Passed    0.20 sec
      Start 14: write_ptcl_4
14/57 Test #14: write_ptcl_4 .....................   Passed    0.20 sec
      Start 15: write_ptcl_empty
15/57 Test #15: write_ptcl_empty .................   Passed    0.18 sec
      Start 16: write_ptcl_noptcls
16/57 Test #16: write_ptcl_noptcls ...............   Passed    0.20 sec
      Start 17: write_ptcl_medium
17/57 Test #17: write_ptcl_medium ................   Passed    0.15 sec
      Start 18: write_ptcl_large
18/57 Test #18: write_ptcl_large .................   Passed    0.61 sec
      Start 19: test_structures_small
19/57 Test #19: test_structures_small ............   Passed    0.11 sec
      Start 20: test_structures_medium
20/57 Test #20: test_structures_medium ...........   Passed    0.21 sec
      Start 21: test_structures_large
21/57 Test #21: test_structures_large ............   Passed    1.12 sec
      Start 22: test_structures_small_4
22/57 Test #22: test_structures_small_4 ..........   Passed    0.83 sec
      Start 23: test_structures_4
23/57 Test #23: test_structures_4 ................   Passed    0.90 sec
      Start 24: test_structures_empty
24/57 Test #24: test_structures_empty ............   Passed    0.46 sec
      Start 25: test_structures_noptcls
25/57 Test #25: test_structures_noptcls ..........   Passed    0.44 sec
      Start 26: destroy_test
26/57 Test #26: destroy_test .....................   Passed    0.30 sec
      Start 27: barycentric_3
27/57 Test #27: barycentric_3 ....................   Passed    0.14 sec
      Start 28: test_adj_2d
28/57 Test #28: test_adj_2d ......................   Passed    0.53 sec
      Start 29: test_adj_3d
29/57 Test #29: test_adj_3d ......................   Passed    1.44 sec
      Start 30: search2d
30/57 Test #30: search2d .........................   Passed    0.18 sec
      Start 31: print_partition_cube_2
31/57 Test #31: print_partition_cube_2 ...........   Passed    0.24 sec
      Start 32: ptn_loading_cube
32/57 Test #32: ptn_loading_cube .................   Passed    0.22 sec
      Start 33: print_partition_cube_4
33/57 Test #33: print_partition_cube_4 ...........   Passed    0.39 sec
      Start 34: ptn_loading_cube_4
34/57 Test #34: ptn_loading_cube_4 ...............   Passed    0.42 sec
      Start 35: print_partition_pisces_4
35/57 Test #35: print_partition_pisces_4 .........   Passed    0.42 sec
      Start 36: ptn_loading_pisces
36/57 Test #36: ptn_loading_pisces ...............   Passed    0.39 sec
      Start 37: print_partition_2d_box_4
37/57 Test #37: print_partition_2d_box_4 .........   Passed    0.37 sec
      Start 38: ptn_loading_2d_box_4
38/57 Test #38: ptn_loading_2d_box_4 .............   Passed    0.34 sec
      Start 39: full_mesh_pisces
39/57 Test #39: full_mesh_pisces .................   Passed    0.37 sec
      Start 40: input_construct_cube
40/57 Test #40: input_construct_cube .............   Passed    0.46 sec
      Start 41: comm_array_pisces
41/57 Test #41: comm_array_pisces ................   Passed    0.72 sec
      Start 42: comm_array_2d_box
42/57 Test #42: comm_array_2d_box ................   Passed    0.53 sec
      Start 43: file_rw_cube_4
43/57 Test #43: file_rw_cube_4 ...................   Passed    0.49 sec
      Start 44: file_rw_xgc_24k_1
44/57 Test #44: file_rw_xgc_24k_1 ................   Passed    0.17 sec
      Start 45: file_rw_xgc_24k_4
45/57 Test #45: file_rw_xgc_24k_4 ................   Passed    0.37 sec
      Start 46: file_rw_xgc_120k_1
46/57 Test #46: file_rw_xgc_120k_1 ...............   Passed    0.30 sec
      Start 47: file_rw_xgc_120k_4
47/57 Test #47: file_rw_xgc_120k_4 ...............   Passed    0.51 sec
      Start 48: lb_r1
48/57 Test #48: lb_r1 ............................   Passed    0.14 sec
      Start 49: lb_r4
49/57 Test #49: lb_r4 ............................   Passed    0.48 sec
      Start 50: pseudoPushAndSearch_t1
50/57 Test #50: pseudoPushAndSearch_t1 ...........   Passed    0.28 sec
      Start 51: pseudoPushAndSearch_t2_r2
51/57 Test #51: pseudoPushAndSearch_t2_r2 ........   Passed    0.70 sec
      Start 52: pseudoPushAndSearch_cube_t1
52/57 Test #52: pseudoPushAndSearch_cube_t1 ......   Passed    0.28 sec
      Start 53: pseudoXGCm_scatter
53/57 Test #53: pseudoXGCm_scatter ...............   Passed    0.14 sec
      Start 54: pseudoXGCm_24kElms
54/57 Test #54: pseudoXGCm_24kElms ...............   Passed    0.30 sec
      Start 55: pseudoXGCm_24kElms_4
55/57 Test #55: pseudoXGCm_24kElms_4 .............   Passed    2.93 sec
      Start 56: pseudoXGCm_120kElms
56/57 Test #56: pseudoXGCm_120kElms ..............   Passed    0.42 sec
      Start 57: pseudoXGCm_120kElms_4
57/57 Test #57: pseudoXGCm_120kElms_4 ............   Passed    1.39 sec
cwsmith commented 8 months ago

Great. Thanks for testing.

IIRC, there were test failures, or a memory leak, that the change was addressing.

Angelyr commented 8 months ago

@cwsmith This commit was to fix a performance issue that Dyhan noticed. The problem is that if sigma was greater than num_elems then no sorting happens.

jacobmerson commented 8 months ago

On commit c17b75a9a5fde6ba815bfe68b9fac2adc64054d5 you probably should have used std::numeric_limits<lid_t>::max because the current code will break if you change the type of lid_t .

Since sigma=INT_MAX won't the following line always be num_elems since num_elems will always be less than INT_MAX?

https://github.com/SCOREC/pumi-pic/blob/c17b75a9a5fde6ba815bfe68b9fac2adc64054d5/particle_structs/src/scs/SCS_sort.h#L18

Angelyr commented 8 months ago

@jacobmerson the line sigma=INT_MAX only affects a few of our tests and is not in our source code. I will start looking into this issue today.

Angelyr commented 8 months ago

@cwsmith I have been testing the code and I have found some issues and some solutions. I want to hear your thoughts.

You can read this file for reference: https://github.com/SCOREC/pumi-pic/blob/ac/thrust-sort/particle_structs/test/sortTest.cpp

Findings:

  1. The issue is that the kokkos sort-by-key algorithm was significantly slower than the thrust sort
  2. The regular kokkos sort is similar in speed to the thrust sort
  3. When sigma is low, the kokkos sort-by-key is significantly faster than thrust sort-by-key
  4. As sigma increases the kokkos sort-by-key gets significantly slower and the thrust sort-by-key gets faster

When I was reading the documentation I found this line that made the kokkos sort-by-key significantly faster (15x). Which is now 1s at 1M elements: int vectorLen = PolicyType::vector_length_max();

Could you explain why this helps and is there a way to improve it more? Here is the docs for reference: https://kokkos.org/kokkos-core-wiki/API/algorithms/Sort.html

However, it is still slower than the thrust sort-by-key which is .0005s at 1M elements.

zhangchonglin commented 8 months ago

@Angelyr: thank you for investigating this. This is a good discovery in that:

This sounds like we can also let Kokkos developer aware of the issue, so they can address this issue from their side as well (aside from you addressing the issue in PUMIPic)?

Angelyr commented 7 months ago

@zhangchonglin I have a change that should resolve the issue on this branch. Feel free to test if you have time:

ac/thrust-sort

zhangchonglin commented 7 months ago

Thanks Angel! Will give it a try later!

zhangchonglin commented 7 months ago

A simple test shows that with your new branch, the time cost is on par with old code. Only pseudoXGCm_24kElms_4 test is about 10-15% slower. Need to test using XGCm with more particles to get reliable results.

      Start 50: pseudoPushAndSearch_t1
50/57 Test #50: pseudoPushAndSearch_t1 ...........   Passed    0.27 sec
      Start 51: pseudoPushAndSearch_t2_r2
51/57 Test #51: pseudoPushAndSearch_t2_r2 ........   Passed    0.89 sec
      Start 52: pseudoPushAndSearch_cube_t1
52/57 Test #52: pseudoPushAndSearch_cube_t1 ......   Passed    0.32 sec
      Start 53: pseudoXGCm_scatter
53/57 Test #53: pseudoXGCm_scatter ...............   Passed    0.14 sec
      Start 54: pseudoXGCm_24kElms
54/57 Test #54: pseudoXGCm_24kElms ...............   Passed    0.34 sec
      Start 55: pseudoXGCm_24kElms_4
55/57 Test #55: pseudoXGCm_24kElms_4 .............   Passed    3.92 sec
      Start 56: pseudoXGCm_120kElms
56/57 Test #56: pseudoXGCm_120kElms ..............   Passed    0.40 sec
      Start 57: pseudoXGCm_120kElms_4
57/57 Test #57: pseudoXGCm_120kElms_4 ............   Passed    1.68 sec
zhangchonglin commented 7 months ago

@Angelyr: with your fix, XGCm time cost is also consistent with kokkos 3.7.02 and earlier PUMIPic dated around June 2023. Thanks for fixing the issue.