UO-OACISS / apex

Autonomic Performance Environment for eXascale (APEX)
Other
38 stars 24 forks source link

Idle rate is not accurate when running with multiple ranks using HPX. #175

Open JiakunYan opened 1 year ago

JiakunYan commented 1 year ago

I am running Octotiger + HPX + APEX with 32 ranks, and I found the idle rate is unreasonably high. Here is the output:

Start Date/Time: 02/05/2023 20:17:12
Elapsed time: 16.9241 seconds
Total processes detected: 32
HW Threads detected on rank 0: 128
Worker Threads observed on rank 0: 128
Available CPU time on rank 0: 2166.28 seconds
Available CPU time on all ranks: 69321 seconds

Counter                                              :  #samp | minimum |    mean  |  maximum |  stddev 
---------------------------------------------------------------------------------------------------------
                               1 Minute Load average :     15       1.00      15.44      25.97      10.56   
                                      Bytes Received : 451251       0.00       0.00       0.00       0.00   
                                          Bytes Sent : 452737     162.00    5021.02   3.47e+05    8650.05   
                                         CPU Guest % :     16       0.00       0.00       0.00       0.00   
                                      CPU I/O Wait % :     16       0.00       0.00       0.02       0.00   
                                           CPU IRQ % :     16       0.29       1.11       1.49       0.34   
                                          CPU Idle % :     16       0.58       8.30      13.68       3.17   
                                          CPU Nice % :     16       0.00       0.00       0.00       0.00   
                                         CPU Steal % :     16       0.00       0.00       0.00       0.00   
                                        CPU System % :     16       0.01       1.38       2.01       0.44   
                                          CPU User % :     16      84.13      89.11      99.09       3.46   
                                      CPU soft IRQ % :     16       0.03       0.10       0.13       0.03   
                                         DRAM Energy :     16       2.00       3.12       4.00       0.48   
                                    Package-0 Energy :     16     159.00     199.56     222.00      24.42   
                                      status:Threads :     17       9.00     132.29     140.00      30.82   
                                    status:VmData kB :     17   1.20e+06   8.74e+06   1.05e+07   2.64e+06   
                                     status:VmExe kB :     17     556.00     556.00     556.00       0.00   
                                     status:VmHWM kB :     17   9.19e+05   4.24e+06   5.10e+06   1.28e+06   
                                     status:VmLck kB :     17       0.00       0.00       0.00       0.00   
                                     status:VmLib kB :     17   4.84e+04   4.84e+04   4.84e+04       0.00   
                                     status:VmPTE kB :     17    2124.00   1.66e+04   1.98e+04    5014.09   
                                    status:VmPeak kB :     17   1.36e+06   8.90e+06   1.06e+07   2.64e+06   
                                     status:VmPin kB :     17   8.37e+05   1.04e+06   1.11e+06   9.16e+04   
                                     status:VmRSS kB :     17   9.19e+05   4.20e+06   5.09e+06   1.29e+06   
                                    status:VmSize kB :     17   1.36e+06   8.90e+06   1.06e+07   2.64e+06   
                                     status:VmStk kB :     17     168.00     168.00     168.00       0.00   
                                    status:VmSwap kB :     17       0.00       0.00       0.00       0.00   
                   status:nonvoluntary_ctxt_switches :     17     633.00     675.35     678.00      10.59   
                      status:voluntary_ctxt_switches :     17    2189.00    4135.35    4257.00     486.59   
---------------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls| #yields| #thread|     mean |    total |  tot/thr | %total | %wall 
--------------------------------------------------------------------------------------------------------------------------------
                                           APEX MAIN :      1        0        1      16.92      16.92      16.92   100.0   100.0
              parallel_policy_executor::sync_execute :   7196   167308      127       0.04     301.26       2.37    13.9    14.0
        node_server::nonrefined_step::compute_fluxes :      1    86601      127     167.28     167.28       1.32     7.7     7.8
                           solve_gravity_action_type :      1    39983      127      49.15      49.15       0.39     2.3     2.3
          all_hydro_bounds::collect_hydro_boundaries :      1   381729      127      23.48      23.48       0.18     1.1     1.1
         timestep_driver_descend::set_local_timestep :      1    25602      127      18.41      18.41       0.14     0.8     0.9
                        compare_analytic_action_type :    994      422      127       0.02      15.88       0.13     0.7     0.7
         all_hydro_bounds::send_hydro_amr_boundaries :  19604    26756      127       0.00      12.70       0.10     0.6     0.6
node_server::collect_hydro_boundaries::set_hydro_bo… : 144890   140715      127       0.00       8.29       0.07     0.4     0.4
                   node_server::create_child::lambda :    864      866      127       0.01       7.07       0.06     0.3     0.3
collect_hydro_boundaries::complete_hydro_amr_bounda… :      1    27106      127       5.06       5.06       0.04     0.2     0.2
    all_hydro_bounds::exchange_interlevel_hydro_data :      1    34771      127       4.37       4.37       0.03     0.2     0.2
                               form_tree_action_type :   4212    79700      127       0.00       4.09       0.03     0.2     0.2
                        node_server::form_tree::sync :  18958    19559      127       0.00       3.51       0.03     0.2     0.2
                          regrid_scatter_action_type :   4212    21811      127       0.00       1.97       0.02     0.1     0.1
node_server::compute_fmm::gather_from::child_gravit… :  43992    43702      127       0.00       1.76       0.01     0.1     0.1
                                    step_action_type :   1988     4959      127       0.00       1.41       0.01     0.1     0.1
                                     background_work :   2686     4767      127       0.00       1.35       0.01     0.1     0.1
                           regrid_gather_action_type :   3218      874      127       0.00       1.33       0.01     0.1     0.1
                                               async :   1906     1067      127       0.00       1.02       0.01     0.0     0.0
                    check_for_refinement_action_type :      1     3895      127       0.98       0.98       0.01     0.0     0.0
                    force_nodes_to_exist_action_type :   4437     2535      127       0.00       0.98       0.01     0.0     0.0
node_server::exchange_flux_corrections::set_flux_re… :  15120    15051      127       0.00       0.78       0.01     0.0     0.0
        node_server::exchange_flux_corrections::sync :  15624    16078      127       0.00       0.72       0.01     0.0     0.0
                node_server::nonrefined_step::update :   5208        0      127       0.00       0.67       0.01     0.0     0.0
N3hpx10components6server23create_component_actionI1… :    163        0       90       0.00       0.65       0.01     0.0     0.0
                        symbol_namespace_bind_action :     62       14       61       0.00       0.15       0.00     0.0     0.0
           primary_namespace_decrement_credit_action :   1541     1622      127       0.00       0.10       0.00     0.0     0.0
                                   initialize_action :      1        0        1       0.09       0.09       0.09     0.0     0.6
              component_namespace_bind_prefix_action :    806      183      127       0.00       0.09       0.00     0.0     0.0
                component_namespace_bind_name_action :    682      175      127       0.00       0.08       0.00     0.0     0.0
                            local_step::execute_step :      1    11856      127       0.04       0.04       0.00     0.0     0.0
                       call_startup_functions_action :      2        0        1       0.01       0.02       0.02     0.0     0.1
                                          run_helper :      1       86       36       0.02       0.02       0.00     0.0     0.0
                    symbol_namespace_on_event_action :     31      107       63       0.00       0.01       0.00     0.0     0.0
                                 shutdown_all_action :      2        4        5       0.00       0.01       0.00     0.0     0.0
                              load_components_action :      1        0        1       0.00       0.00       0.00     0.0     0.0
                              cleanup_buffers_action :      1        0        1       0.00       0.00       0.00     0.0     0.0
                                            hpx_main :      1        2        2       0.00       0.00       0.00     0.0     0.0
                         broadcast_initialize_action :      1        1        1       0.00       0.00       0.00     0.0     0.0
                        copy_to_locality_action_type :      2        0        1       0.00       0.00       0.00     0.0     0.0
           primary_namespace_increment_credit_action :      4        0        3       0.00       0.00       0.00     0.0     0.0
                                node_client::get_ptr :      2        2        2       0.00       0.00       0.00     0.0     0.0
                         dijkstra_termination_action :      2        4        5       0.00       0.00       0.00     0.0     0.0
                   primary_namespace_colocate_action :      1        0        1       0.00       0.00       0.00     0.0     0.0
--------------------------------------------------------------------------------------------------------------------------------

                                           APEX Idle :                       6.87e+04    99.1

--------------------------------------------------------------------------------------------------------------------------------
                                        Total timers : 298,423

The idle rate is 99.1%.

After looking at the code of APEX, I found the following lines:

double all_total_main = total_main * apex::instance()->get_num_ranks();
double idle_rate = all_total_main - total_accumulated;
...
screen_output << string_format(FORMAT_PERCENT,
          ((idle_rate/all_total_main)*100)) << endl;

I don't understand why APEX chooses to multiply the total_main with rank num. I didn't find any evidence that APEX aggregates timers across ranks, does it?

Another quick question: is there a way to get the report from all HPX ranks, instead of just rank 0?

khuck commented 1 year ago

The APEX Idle timer is not a rate or percentage, it is an accumulated time. I recently reformatted the output and did not align those columns in the right place to line up with the other timers, sorry for the confusion.

A few other observations:

Total processes detected: 32
HW Threads detected on rank 0: 128
Worker Threads observed on rank 0: 128
Available CPU time on rank 0: 2166.28 seconds
Available CPU time on all ranks: 69321 seconds

What this part of the report is telling you is that you are using 128 threads per process, and 32 total processes. That's fine, but you're likely using hyper threading which may or may not be more efficient than running without hyper threading.

The timer report is aggregated and averaged across all ranks. The total main time represents all time from when rank 0 started until it ended, and APEX assumes all processes had the same lifetime.

If you want the data from all ranks and you are using apex_exec, you can add the --apex:csv flag. If you are not using apex_exec, set the environment variable export APEX_CSV_OUTPUT=1 before running. Then you will get a file called apex_profiles.csv which contains all the performance data from all ranks. You can post-process that data with the apex-summary.py python script.

JiakunYan commented 1 year ago

The APEX Idle timer is not a rate or percentage, it is an accumulated time.

I understood. I was referring to the percentage after the idle rate number 99.1%. I am quite sure this percentage is not correct, since if I run it with only one rank, I can get a percentage of 20%~30%, and the application scales well to 32 ranks.

but you're likely using hyper threading which may or may not be more efficient than running without hyper threading.

Actually, the compute nodes I used has 128 cores, so I was not using hyper-threading.

The timer report is aggregated and averaged across all ranks.

Could you point me to the code that does this? I failed to find any communication across ranks.

If you want the data from all ranks and you are using apex_exec, you can add the --apex:csv flag. If you are not using apex_exec, set the environment variable export APEX_CSV_OUTPUT=1 before running. Then you will get a file called apex_profiles.csv which contains all the performance data from all ranks. You can post-process that data with the apex-summary.py python script.

Thanks! This is very helpful!

khuck commented 1 year ago

see https://github.com/UO-OACISS/apex/blob/develop/src/apex/profile_reducer.cpp for the profile reduction routines...

I should mention that the data reduction will only work if HPX is using the MPI parcel port. I haven't added support for other HPX parcel ports (TCP, libfabric, etc)

JiakunYan commented 1 year ago

I should mention that the data reduction will only work if HPX is using the MPI parcel port. I haven't added support for other HPX parcel ports (TCP, libfabric, etc)

Make sense! Since I was not using the MPI parcelport...

Thank you so much!

JiakunYan commented 1 year ago

Just curious: in the reduce_profiles_for_screen function, is there any reason APEX doesn't choose to initialize MPI when it finds MPI is not initialized? I have both MPI and LCI parcelport compiled into HPX, so it should be able to work if APEX calls MPI_Init.