E3SM-Project / EKAT

Tools and libraries for writing Kokkos-enabled HPC C++ in E3SM ecosystem
Other
15 stars 7 forks source link

Potential problem with ExeSpaceUtils view_reduction and parallel_reduce #254

Open jgfouca opened 2 years ago

jgfouca commented 2 years ago

Describe the bug This was discovered when porting shoc_energy_integrals to small kernels. I was getting large differences in the outputs of the view_reductions when num_threads>1. I suspect the problem is in the handling of the garbage of the last pack because the problem went away when I used nlev % pack_size = 0.

To Reproduce Steps to reproduce the behavior:

  1. Switch shoc_energy_integrals to the implementation it had before the small kernel PR. The one that uses view_reductions.
  2. Build SCREAM with -DSCREAM_SMALL_KERNELS=On -DCMAKE_BUILD_TYPE=Debug
  3. run OMP_NUM_THREADS=16 ./shoc_tests shoc_main_bfb
  4. This should fail due to being non_bfb with fortran. You can add print statements to confirm that the se_int, ke_int, wv_int, and wl_int values do not match fortran, which causes different results later in shoc for the output views.

Expected behavior view_reduction should have produced bfb results with fortran.

jgfouca commented 2 years ago

Upon a second look, nlev%pack_size !=0 is not necessary to demonstrate the error but it does make the errors more frequent. When I switched to ExeSpaceUtils::parallel_reduce on scalarized views, I had similar problems until I made sure each thread had a local variable passed to the reducer. The local variable approach also fixed view_reduction.

jgfouca commented 2 years ago

I should also note that the error only occurs when team_size > 1, which is what you get when MIMIC_GPU is On (team size 7) which is On by default for Debug builds.

jgfouca commented 2 years ago

I believe the problems with ExeSpaceUtils::parallel_reduce were not fixed. @bartgol , correct me if I'm wrong.

bartgol commented 2 years ago

Yes, you're right. I was working on completing last Friday, but did not finish by week's end. I should be done today.

bartgol commented 1 year ago

I think this was completed in #258. Closing.