OP-DSL / OP2-Common

OP2: open-source framework for the execution of unstructured grid applications on clusters of GPUs or multi-core CPUs
https://op-dsl.github.io
Other
98 stars 46 forks source link

invalid code generated for kernels with optional arguments only #214

Closed mbuergle closed 3 years ago

mbuergle commented 3 years ago

We found that code is not correctly generated if a kernel only requires optional arguments.

Below a part of the falsely generated kernel with four optional arguments. The generated code contains four times the checks if (arg0.opt) instead of also if (arg1.opt), etc.:


  int set_size = op_mpi_halo_exchanges(set, nargs, args);

  if (set_size > 0) {

    for ( int n=0; n<set_size; n++ ){
      if (n<set->core_size && n>0 && n % OP_mpi_test_frequency == 0)
        op_mpi_test_all(nargs,args);
      if (n==set->core_size) {
        op_mpi_wait_all(nargs, args);
      }
      int map0idx;
      if (arg0.opt) {
        map0idx = arg0.map_data[n * arg0.map->dim + 0];
      }
      if (arg0.opt) {
        map0idx = arg0.map_data[n * arg0.map->dim + 0];
      }
      if (arg0.opt) {
        map0idx = arg0.map_data[n * arg0.map->dim + 0];
      }
      if (arg0.opt) {
        map0idx = arg0.map_data[n * arg0.map->dim + 0];
      }

      kernel_MORPHOLOGY_WallFluxes(
        &((double*)arg0.data)[1 * map0idx],
        &((double*)arg1.data)[2 * map0idx],
        &((double*)arg2.data)[3 * map0idx],
        &((double*)arg3.data)[4 * map0idx]);
    }
  }
reguly commented 3 years ago

Thanks - I pushed a fix to fix/optional, #215 - please check if that fixes your issue. Note that arg0.map_data is still valid even if arg0 is optional and not present, if the map it uses is present.

Cheers, Istvan

On 2021. Aug 30., at 17:01, mbuergle @.***> wrote:

We found that code is not correctly generated if a kernel only requires optional arguments.

Part of the falsely generated kernel with 4 optional arguments. The generated code contains four times the checks if (arg0.opt) instead of also if (arg1.opt), etc.:

int set_size = op_mpi_halo_exchanges(set, nargs, args);

if (set_size > 0) {

for ( int n=0; n<set_size; n++ ){
  if (n<set->core_size && n>0 && n % OP_mpi_test_frequency == 0)
    op_mpi_test_all(nargs,args);
  if (n==set->core_size) {
    op_mpi_wait_all(nargs, args);
  }
  int map0idx;
  if (arg0.opt) {
    map0idx = arg0.map_data[n * arg0.map->dim + 0];
  }
  if (arg0.opt) {
    map0idx = arg0.map_data[n * arg0.map->dim + 0];
  }
  if (arg0.opt) {
    map0idx = arg0.map_data[n * arg0.map->dim + 0];
  }
  if (arg0.opt) {
    map0idx = arg0.map_data[n * arg0.map->dim + 0];
  }

  kernel_MORPHOLOGY_WallFluxes(
    &((double*)arg0.data)[1 * map0idx],
    &((double*)arg1.data)[2 * map0idx],
    &((double*)arg2.data)[3 * map0idx],
    &((double*)arg3.data)[4 * map0idx]);
}

} — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/OP-DSL/OP2-Common/issues/214, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJWVVPS5XLCJ5S632F7NY3T7OMMLANCNFSM5DCB7YKA. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

mbuergle commented 3 years ago

Thank you for the fix. It solve the original problem, but now leads to problems for another kernel. In the generated code below, checks are introduced for arguments that are not declared in the scope of the kernel (e.g. arg6, etc.).

void op_par_loop_kernel_MORPHOLOGY_CalcBedGradient(char const *name, op_set set,
  op_arg arg0,
  op_arg arg1,
  op_arg arg4,
  op_arg arg5,
  op_arg arg8,
  op_arg arg11,
  op_arg arg12,
  op_arg arg15){

  int nargs = 16;
  op_arg args[16];

  args[0] = arg0;
  arg1.idx = 0;
  args[1] = arg1;
  for ( int v=1; v<3; v++ ){
    args[1 + v] = op_opt_arg_dat(arg1.opt, arg1.dat, v, arg1.map, 2, "double", OP_READ);
  }

  args[4] = arg4;
  arg5.idx = 0;
  args[5] = arg5;
  for ( int v=1; v<3; v++ ){
    args[5 + v] = op_opt_arg_dat(arg5.opt, arg5.dat, v, arg5.map, 1, "double", OP_READ);
  }

  arg8.idx = 0;
  args[8] = arg8;
  for ( int v=1; v<3; v++ ){
    args[8 + v] = op_opt_arg_dat(arg8.opt, arg8.dat, v, arg8.map, 2, "double", OP_READ);
  }

  args[11] = arg11;
  arg12.idx = 0;
  args[12] = arg12;
  for ( int v=1; v<3; v++ ){
    args[12 + v] = op_arg_dat(arg12.dat, v, arg12.map, 1, "double", OP_READ);
  }

  args[15] = arg15;

  // initialise timers
  double cpu_t1, cpu_t2, wall_t1, wall_t2;
  op_timing_realloc(19);
  op_timers_core(&cpu_t1, &wall_t1);

  if (OP_diags>2) {
    printf(" kernel routine with indirection: kernel_MORPHOLOGY_CalcBedGradient\n");
  }

  int set_size = op_mpi_halo_exchanges(set, nargs, args);

  if (set_size > 0) {

    for ( int n=0; n<set_size; n++ ){
      if (n<set->core_size && n>0 && n % OP_mpi_test_frequency == 0)
        op_mpi_test_all(nargs,args);
      if (n==set->core_size) {
        op_mpi_wait_all(nargs, args);
      }
      int map1idx;
      int map2idx;
      int map3idx;
      int map5idx;
      int map6idx;
      int map7idx;
      map1idx = arg1.map_data[n * arg1.map->dim + 0];
      map2idx = arg1.map_data[n * arg1.map->dim + 1];
      map3idx = arg1.map_data[n * arg1.map->dim + 2];
      if (arg5.opt) {
        map5idx = arg5.map_data[n * arg5.map->dim + 0];
      }
      if (arg6.opt) {
        map6idx = arg5.map_data[n * arg5.map->dim + 1];
      }
      if (arg7.opt) {
        map7idx = arg5.map_data[n * arg5.map->dim + 2];
      }
      if (arg8.opt) {
        map5idx = arg5.map_data[n * arg5.map->dim + 0];
      }
      if (arg9.opt) {
        map6idx = arg5.map_data[n * arg5.map->dim + 1];
      }
      if (arg10.opt) {
        map7idx = arg5.map_data[n * arg5.map->dim + 2];
      }
reguly commented 3 years ago

Apologies, I forgot about vectorised arguments... Should be fixed now.

mbuergle commented 3 years ago

Thank you for the quick fix. It solved the latest problem. However, I still fail to link the mixed backend executable. Below the resulting output. We will check if the problem lies on our side.

externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_halo_exchanges':
op_dummy_singlenode.c:(.text+0x0): multiple definition of `op_mpi_halo_exchanges'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x31b0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_halo_exchanges_grouped':
op_dummy_singlenode.c:(.text+0x10): multiple definition of `op_mpi_halo_exchanges_grouped'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3230): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_set_dirtybit':
op_dummy_singlenode.c:(.text+0x20): multiple definition of `op_mpi_set_dirtybit'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3320): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_wait_all':
op_dummy_singlenode.c:(.text+0x30): multiple definition of `op_mpi_wait_all'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3380): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_wait_all_grouped':
op_dummy_singlenode.c:(.text+0x40): multiple definition of `op_mpi_wait_all_grouped'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3390): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_halo_exchanges_cuda':
op_dummy_singlenode.c:(.text+0x70): multiple definition of `op_mpi_halo_exchanges_cuda'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x33a0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_set_dirtybit_cuda':
op_dummy_singlenode.c:(.text+0x80): multiple definition of `op_mpi_set_dirtybit_cuda'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3420): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_wait_all_cuda':
op_dummy_singlenode.c:(.text+0x90): multiple definition of `op_mpi_wait_all_cuda'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3480): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_reset_halos':
op_dummy_singlenode.c:(.text+0xa0): multiple definition of `op_mpi_reset_halos'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3490): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_barrier':
op_dummy_singlenode.c:(.text+0xb0): multiple definition of `op_mpi_barrier'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x34a0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_perf_time':
op_dummy_singlenode.c:(.text+0xc0): multiple definition of `op_mpi_perf_time'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x34b0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_reduce_float':
op_dummy_singlenode.c:(.text+0xe0): multiple definition of `op_mpi_reduce_float'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x34c0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_reduce_double':
op_dummy_singlenode.c:(.text+0xf0): multiple definition of `op_mpi_reduce_double'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x34d0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_reduce_int':
op_dummy_singlenode.c:(.text+0x100): multiple definition of `op_mpi_reduce_int'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x34e0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_mpi_reduce_bool':
op_dummy_singlenode.c:(.text+0x110): multiple definition of `op_mpi_reduce_bool'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x34f0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_partition':
op_dummy_singlenode.c:(.text+0x120): multiple definition of `op_partition'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3500): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_partition_ptr(char const*, char const*, op_set_core*, int*, double*)':
op_dummy_singlenode.c:(.text+0x130): multiple definition of `op_partition_ptr(char const*, char const*, op_set_core*, int*, double*)'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3510): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_renumber':
op_dummy_singlenode.c:(.text+0x140): multiple definition of `op_renumber'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0x770): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_renumber_ptr':
op_dummy_singlenode.c:(.text+0x150): multiple definition of `op_renumber_ptr'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0x780): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_compute_moment':
op_dummy_singlenode.c:(.text+0x160): multiple definition of `op_compute_moment'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3530): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_compute_moment_across_times':
op_dummy_singlenode.c:(.text+0x170): multiple definition of `op_compute_moment_across_times'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3540): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_partition_reverse()':
op_dummy_singlenode.c:(.text+0x220): multiple definition of `op_partition_reverse()'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x3520): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_is_root':
op_dummy_singlenode.c:(.text+0x250): multiple definition of `op_is_root'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x35f0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `getHybridGPU()':
op_dummy_singlenode.c:(.text+0x260): multiple definition of `getHybridGPU()'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0x790): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_theta_init(op_export_core*, int*, double*, double*, double*)':
op_dummy_singlenode.c:(.text+0x270): multiple definition of `op_theta_init(op_export_core*, int*, double*, double*, double*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xcd0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_inc_theta(op_export_core*, int*, double*, double*)':
op_dummy_singlenode.c:(.text+0x280): multiple definition of `op_inc_theta(op_export_core*, int*, double*, double*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xce0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_import_init_size(int, int*, op_dat_core*)':
op_dummy_singlenode.c:(.text+0x290): multiple definition of `op_import_init_size(int, int*, op_dat_core*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xca0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_import_init(op_export_core*, op_dat_core*, op_dat_core*)':
op_dummy_singlenode.c:(.text+0x2a0): multiple definition of `op_import_init(op_export_core*, op_dat_core*, op_dat_core*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xcb0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_export_init(int, int*, op_map_core*, op_set_core*, op_dat_core*, op_dat_core*)':
op_dummy_singlenode.c:(.text+0x2b0): multiple definition of `op_export_init(int, int*, op_map_core*, op_set_core*, op_dat_core*, op_dat_core*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xcc0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_export_data(op_export_core*, op_dat_core*)':
op_dummy_singlenode.c:(.text+0x2c0): multiple definition of `op_export_data(op_export_core*, op_dat_core*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xcf0): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `op_import_data(op_import_core*, op_dat_core*)':
op_dummy_singlenode.c:(.text+0x2d0): multiple definition of `op_import_data(op_import_core*, op_dat_core*)'
externalLibs/libop2_cuda.a(op_cuda_decl.c.o):tmpxft_00004df6_00000000-5_op_cuda_decl.cudafe1.cpp:(.text+0xd00): first defined here
externalLibs/libop2_core.a(op_dummy_singlenode.c.o): In function `deviceSync':
op_dummy_singlenode.c:(.text+0x2e0): multiple definition of `deviceSync'
externalLibs/libop2_cuda.a(op_cuda_rt_support.c.o):tmpxft_00004df4_00000000-5_op_cuda_rt_support.cudafe1.cpp:(.text+0x2ff0): first defined here
collect2: error: ld returned 1 exit status
reguly commented 3 years ago

I think it's that libop2_core and libop2_cuda should not be both linked, just libop2_cuda

reguly commented 3 years ago

This is now merged in

mbuergle commented 3 years ago

Thank you for the hint with libop2_core and libop2_cuda.