Exawind / nalu-wind

Solver for wind farm simulations targeting exascale computational platforms
https://exawind.github.io/nalu-wind/
Other
125 stars 85 forks source link

GPU issue with SCS_SHIFTED_GRAD_OP and SCS_GRAD_OP #1302

Closed marchdf closed 3 weeks ago

marchdf commented 1 month ago

I am having an issue with the following unit tests ./unittestX --gtest_filter="WallDistKernelHex8Mesh.NGP_wall_dist*"

They are throwing:

[==========] Running 3 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 3 tests from WallDistKernelHex8Mesh
[ RUN      ] WallDistKernelHex8Mesh.NGP_wall_dist
/mnt/vdb/home/mhenryde/exawind/exawind-manager/environments/nalu-wind-gpu-test/nalu-wind/unit_tests/kernels/UnitTestKernelUtils.h:219: Failure
The difference between hostCalcValue(i, j) and exactValue[i][j] is 0.421875, which exceeds tol, where
hostCalcValue(i, j) evaluates to 0,
exactValue[i][j] evaluates to 0.421875, and
tol evaluates to 1.0000000000000001e-15.
/mnt/vdb/home/mhenryde/exawind/exawind-manager/environments/nalu-wind-gpu-test/nalu-wind/unit_tests/kernels/UnitTestKernelUtils.h:219: Failure
The difference between hostCalcValue(i, j) and exactValue[i][j] is 0.046875, which exceeds tol, where
hostCalcValue(i, j) evaluates to 0,
exactValue[i][j] evaluates to -0.046875, and
tol evaluates to 1.0000000000000001e-15.
/mnt/vdb/home/mhenryde/exawind/exawind-manager/environments/nalu-wind-gpu-test/nalu-wind/unit_tests/kernels/UnitTestKernelUtils.h:219: Failure

In digging further it appears that v_dndx(ip, ic, j) here is zero (v_scs_areav(ip, j) is non-zero). Why is that?

v_dndx comes from const auto& v_dndx = shiftPoisson_ ? meViews.dndx_shifted : meViews.dndx;, which should be defined by

  auto gradOp = shiftPoisson_ ? SCS_SHIFTED_GRAD_OP : SCS_GRAD_OP;
  dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES);

But why would those be zero? This test seems fine on CPU.

I did fix a (minor) thing in the way the expect_all_near was being called. But that's not the issue. See PR #1301.

Tagging @psakievich, @rcknaus and @alanw0 as people who might know what's going on.

marchdf commented 1 month ago

Of course this looks fine with a Release build... The above was a Debug build.

alanw0 commented 1 month ago

Marc, my initial thought was that there is a missing 'field.sync_to_host()' somewhere, which would explain why the 'hostCalcValue' might be zero when it is expected to be non-zero. But I certainly don't know why it would be different between debug and release... At the point where it is checking those values, you could add EXPECT_FALSE(field.need_sync_to_host()); (whatever the actual 'field' variable is). Recall the usage pattern for these fields should always be to sync it to the current/desired memory space before accessing, and if writing to it, call field.modify_on_device()/modify_on_host(). The 'smart field' idea that Phil was deploying would automate that pattern, but I don't think he got that deployed everywhere yet.

marchdf commented 1 month ago

Hi Alan, thanks for getting back to me so quickly. I will add the check you suggest.

marchdf commented 1 month ago

@alanw0 I tried implementing the check but the places where this breaks don't have a need_sync_to_host() member function.

The problem really is that v_dndx = 0 in this expression lhsfac += v_dndx(ip, ic, j) * v_scs_areav(ip, j); (here). If I just hard code that to some value (or just leave v_scs_areav) then the hostCalcValue is non zero, indicating that everything is being copied back to host ok. v_dndx is non-zero for a CPU build or a Release build.

I am not familiar with these master element views and I am worried that somehow dndx asked by SCS_SHIFTED_GRAD_OP or SCS_GRAD_OP when doing dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES); is somehow not getting to GPU.

marchdf commented 1 month ago

One thing that worries me is that any of these tests for these master element views are ifdefed out on GPU: https://github.com/Exawind/nalu-wind/blob/master/unit_tests/UnitTestKokkosME.C#L296. It looks like that was done by you @alanw0 in https://github.com/Exawind/nalu-wind/commit/111af200c313a0db7187a560ae52f9787c1836e3. So maybe they are buggy? But I can't see why scs_areav would work but not dndx.

alanw0 commented 1 month ago

One thing that worries me is that any of these tests for these master element views are ifdefed out on GPU:

Memories are fuzzy, but as I recall we originally had a bunch of code that wouldn't work on GPU, and we gradually started converting it. At some point we did this ifdef'ing to guard the not-yet-converted code, and then we down-selected to only convert the code/algorithms that were specifically needed for the challenge problems, as time ran short.

alanw0 commented 1 month ago

I am not familiar with these master element views and I am worried that somehow dndx asked by SCS_SHIFTED_GRAD_OP or SCS_GRAD_OP when doing dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES); is somehow not getting to GPU.

I'm going to tag James Overfelt, who was the master-element expert. @overfelt

overfelt commented 1 month ago

I think these unit tests at one time compared the results of the old CPU only (Fortran based) master element functions to the newer CPU/GPU callable implementations. With the last refactor to get rid of all of the Fortran code these tests could be deleted.

marchdf commented 1 month ago

@overfelt thank you. Do you see anything in the implementation of SCS_SHIFTED_GRAD_OP and SCS_GRAD_OP that would explain the behavior I am seeing here: https://github.com/Exawind/nalu-wind/issues/1302#issuecomment-2371754112

overfelt commented 1 month ago

The code looks fine to me. It is odd that most of the computed arrays are correct and only one is all zeros. Do you happen to know the value of shiftPoisson_ in this case?

marchdf commented 1 month ago

know the value of shiftPoisson_ in this case

yes, it is true for one test, false for the other. Both fail.

marchdf commented 3 weeks ago

Fixed in #1317