OpenMP performance issue

basegpu commented 5 years ago

We recently encountered a strange performance pattern of the OpenMp backend, which we have never seen before: Running our performance benchmark (solving the shallow water equations, circular dam break) we observed very long times spent in ONE kernel. the bigger the mesh gets, and the more threads we use, the more punctuated the problem gets. What is puzzling me, is that we obtain it only on a few clusters, but not on local machines!

Have you ever seen/experienced similar, or does anybody have an idea what might be the reason of this behaviour. Any comments/ideas/feedbacks are much appreciated.

I listed some information about the problem below (examples from running on 4 threads).

the common picture:

  count   plan time     MPI time(std)        time(std)           GB/s      GB/s   kernel name 
 -------------------------------------------------------------------------------------------
      1;    0.0000;    0.0000(  0.0000);    0.0020(  0.0000);   20.3814;         ;   InitCellHydFeatures 
   1629;    0.0000;    0.0000(  0.0000);   16.6781(  0.0000);  15.6346;  16.2859;   TimeStepCalculation 
   1629;    0.0000;    0.0000(  0.0000);    0.1602(  0.0000);   3.0701;  13.0268;   HydWallBFluxes 
   1629;    0.0000;    0.0000(  0.0000);   51.4494(  0.0000);   6.7690;   7.0689;   HydInternalFluxes 
   1629;    0.0000;    0.0000(  0.0000);   25.4469(  0.0000);  16.1506;  17.1526;   Update 
Total plan time:   1.2157

the anomalous picture:

  count   plan time     MPI time(std)        time(std)           GB/s      GB/s   kernel name 
 -------------------------------------------------------------------------------------------
      1;    0.0000;    0.0000(  0.0000);    0.0046(  0.0000);    8.6442;         ;   InitCellHydFeatures 
   1629;    0.0000;    0.0000(  0.0000);   15.6782(  0.0000);  16.6317;  17.3246;   TimeStepCalculation 
   1629;    0.0000;    0.0000(  0.0000);    0.1812(  0.0000);   2.7140;  11.5160;   HydWallBFluxes 
   1629;    0.0000;    0.0000(  0.0000);   49.9202(  0.0000);   6.9764;   7.2854;   HydInternalFluxes 
   1629;    0.0000;    0.0000(  0.0000);  107.2164(  0.0000);   3.8332;   4.0710;   Update 
Total plan time:   1.3787

The misbehaving kernel Update:

op_par_loop(
        kernel_GLOBAL_Update,
        this->model->getKernel(KernelID::UPDATE)->getNameC(),
        op_sets[SetID::CELLSALL],
        op_arg_gbl    (       &dt,                                                                               1, real_str, OP_READ ),
        op_arg_gbl    (       &mFlag,                                                                            1, "int",    OP_READ ),
        op_arg_dat    (       op_dats[DataID::CELLSALL_GEOMAREA],     -1, OP_ID,                                 1, real_str, OP_READ ),
        op_arg_dat    (       op_dats[DataID::EDGESALL_GEOMLENG], -N_E_C, op_maps[MapID::CELLSALL_EDGESALL],     1, real_str, OP_READ ),
        op_arg_dat    (       op_dats[DataID::EDGESALL_FLUXESHY], -N_E_C, op_maps[MapID::CELLSALL_EDGESALL], N_F_H, real_str, OP_READ ),
        op_opt_arg_dat(mFlag, op_dats[DataID::EDGESALL_FLUXESMO], -N_E_C, op_maps[MapID::CELLSALL_EDGESALL], N_F_M, real_str, OP_READ ),
        op_arg_dat    (       op_dats[DataID::CELLSALL_EDGESIGN],     -1, OP_ID,                             N_E_C, "int",    OP_READ ),
        op_arg_dat    (       op_dats[DataID::CELLSALL_HYDSTATE],     -1, OP_ID,                             N_V_H, real_str, OP_RW   ),
        op_arg_dat    (       op_dats[DataID::CELLSALL_BOTTOMEL],     -1, OP_ID,                             N_V_M, real_str, OP_RW   ),
        op_arg_dat    (       op_dats[DataID::CELLSALL_ISWETTED],     -1, OP_ID,                                 1, "int",    OP_WRITE),
        op_arg_gbl    (       &Running,                                                                          1, "int",    OP_WRITE)
    );

kernel routine with indirection: kernel_GLOBAL_Update
 new execution plan #3 for kernel Update
 number of blocks       = 3884 
 number of block colors = 1 
 maximum block size     = 256 
 average thread colors  = 1.00 
 shared memory required =  33.94 KB
 average data reuse     = 1.09 
 data transfer (used)   = 240.61 MB 
 data transfer (total)  = 255.53 MB 
 SoA/AoS transfer ratio = 1.02

reguly commented 5 years ago

I have not really seen a single kernel slowing down to such an extent. Are you using OpenMP thread binding to cores? If not, that may cause performance issues with OpenMP (though this still sounds too severe). If you are running with Intel, you can set the following environment variable: KMP_AFFINITY=compact, if e.g. GNU, then probably OMP_PROC_BIND=True should work.

The machine you are running on has 4 logical cores? Or is this an MPI+OpenMP setup?

On 2018. Nov 29., at 21:20, Samuel Peter notifications@github.com wrote:

We recently encountered a strange performance pattern of the OpenMp backend, which we have never seen before: Running our performance benchmark (solving the shallow water equations, circular dam break) we observed very long times spent in ONE kernel. the bigger the mesh gets, and the more threads we use, the more punctuated the problem gets. What is puzzling me, is that we obtain it only on a few clusters, but not on local machines!

Have you ever seen/experienced similar, or does anybody have an idea what might be the reason of this behaviour. Any comments/ideas/feedbacks are much appreciated.

I listed some information about the problem below (examples from running on 4 threads).

the common picture:

count plan time MPI time(std) time(std) GB/s GB/s kernel name
  1;    0.0000;    0.0000(  0.0000);    0.0020(  0.0000);   20.3814;         ;   InitCellHydFeatures 
1629; 0.0000; 0.0000( 0.0000); 16.6781( 0.0000); 15.6346; 16.2859; TimeStepCalculation 1629; 0.0000; 0.0000( 0.0000); 0.1602( 0.0000); 3.0701; 13.0268; HydWallBFluxes 1629; 0.0000; 0.0000( 0.0000); 51.4494( 0.0000); 6.7690; 7.0689; HydInternalFluxes 1629; 0.0000; 0.0000( 0.0000); 25.4469( 0.0000); 16.1506; 17.1526; Update Total plan time: 1.2157 the anomalous picture:

count plan time MPI time(std) time(std) GB/s GB/s kernel name
  1;    0.0000;    0.0000(  0.0000);    0.0046(  0.0000);    8.6442;         ;   InitCellHydFeatures 
1629; 0.0000; 0.0000( 0.0000); 15.6782( 0.0000); 16.6317; 17.3246; TimeStepCalculation 1629; 0.0000; 0.0000( 0.0000); 0.1812( 0.0000); 2.7140; 11.5160; HydWallBFluxes 1629; 0.0000; 0.0000( 0.0000); 49.9202( 0.0000); 6.9764; 7.2854; HydInternalFluxes 1629; 0.0000; 0.0000( 0.0000); 107.2164( 0.0000); 3.8332; 4.0710; Update Total plan time: 1.3787 The misbehaving kernel Update:

op_par_loop( kernel_GLOBAL_Update, this->model->getKernel(KernelID::UPDATE)->getNameC(), op_sets[SetID::CELLSALL], op_arg_gbl ( &dt, 1, real_str, OP_READ ), op_arg_gbl ( &mFlag, 1, "int", OP_READ ), op_arg_dat ( op_dats[DataID::CELLSALL_GEOMAREA], -1, OP_ID, 1, real_str, OP_READ ), op_arg_dat ( op_dats[DataID::EDGESALL_GEOMLENG], -N_E_C, op_maps[MapID::CELLSALL_EDGESALL], 1, real_str, OP_READ ), op_arg_dat ( op_dats[DataID::EDGESALL_FLUXESHY], -N_E_C, op_maps[MapID::CELLSALL_EDGESALL], N_F_H, real_str, OP_READ ), op_opt_arg_dat(mFlag, op_dats[DataID::EDGESALL_FLUXESMO], -N_E_C, op_maps[MapID::CELLSALL_EDGESALL], N_F_M, real_str, OP_READ ), op_arg_dat ( op_dats[DataID::CELLSALL_EDGESIGN], -1, OP_ID, N_E_C, "int", OP_READ ), op_arg_dat ( op_dats[DataID::CELLSALL_HYDSTATE], -1, OP_ID, N_V_H, real_str, OP_RW ), op_arg_dat ( op_dats[DataID::CELLSALL_BOTTOMEL], -1, OP_ID, N_V_M, real_str, OP_RW ), op_arg_dat ( op_dats[DataID::CELLSALL_ISWETTED], -1, OP_ID, 1, "int", OP_WRITE), op_arg_gbl ( &Running, 1, "int", OP_WRITE) ); kernel routine with indirection: kernel_GLOBAL_Update new execution plan #3 for kernel Update number of blocks = 3884 number of block colors = 1 maximum block size = 256 average thread colors = 1.00 shared memory required = 33.94 KB average data reuse = 1.09 data transfer (used) = 240.61 MB data transfer (total) = 255.53 MB SoA/AoS transfer ratio = 1.02 — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OP-DSL/OP2-Common/issues/151, or mute the thread https://github.com/notifications/unsubscribe-auth/ABNq1YPKMhHz2A0yhUgA3GSt6Fa5U7wGks5u0EGGgaJpZM4Y6cUh.

basegpu commented 5 years ago

The machine you are running on has 4 logical cores? Or is this an MPI+OpenMP setup?

Yes, OpenMP setup with 4 logical cores.

OMP_PROC_BIND=True should work

Indeed, this improved a lot! obviously the binding settings were not optimal. We are now testing other optimization on this level. Many thanks, I was not aware of these omp variables!

OP-DSL / OP2-Common

OpenMP performance issue #151

count plan time MPI time(std) time(std) GB/s GB/s kernel name

count plan time MPI time(std) time(std) GB/s GB/s kernel name