Closed marchdf closed 3 weeks ago
Of course this looks fine with a Release build... The above was a Debug build.
Marc, my initial thought was that there is a missing 'field.sync_to_host()' somewhere, which would explain why the 'hostCalcValue' might be zero when it is expected to be non-zero.
But I certainly don't know why it would be different between debug and release...
At the point where it is checking those values, you could add EXPECT_FALSE(field.need_sync_to_host());
(whatever the actual 'field' variable is).
Recall the usage pattern for these fields should always be to sync it to the current/desired memory space before accessing, and if writing to it, call field.modify_on_device()/modify_on_host().
The 'smart field' idea that Phil was deploying would automate that pattern, but I don't think he got that deployed everywhere yet.
Hi Alan, thanks for getting back to me so quickly. I will add the check you suggest.
@alanw0 I tried implementing the check but the places where this breaks don't have a need_sync_to_host()
member function.
The problem really is that v_dndx = 0
in this expression lhsfac += v_dndx(ip, ic, j) * v_scs_areav(ip, j);
(here). If I just hard code that to some value (or just leave v_scs_areav
) then the hostCalcValue
is non zero, indicating that everything is being copied back to host ok. v_dndx
is non-zero for a CPU build or a Release build.
I am not familiar with these master element views and I am worried that somehow dndx
asked by SCS_SHIFTED_GRAD_OP or SCS_GRAD_OP when doing dataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES);
is somehow not getting to GPU.
One thing that worries me is that any of these tests for these master element views are ifdefed out on GPU: https://github.com/Exawind/nalu-wind/blob/master/unit_tests/UnitTestKokkosME.C#L296. It looks like that was done by you @alanw0 in https://github.com/Exawind/nalu-wind/commit/111af200c313a0db7187a560ae52f9787c1836e3. So maybe they are buggy? But I can't see why scs_areav
would work but not dndx
.
One thing that worries me is that any of these tests for these master element views are ifdefed out on GPU:
Memories are fuzzy, but as I recall we originally had a bunch of code that wouldn't work on GPU, and we gradually started converting it. At some point we did this ifdef'ing to guard the not-yet-converted code, and then we down-selected to only convert the code/algorithms that were specifically needed for the challenge problems, as time ran short.
I am not familiar with these master element views and I am worried that somehow
dndx
asked by SCS_SHIFTED_GRAD_OP or SCS_GRAD_OP when doingdataPreReqs.add_master_element_call(gradOp, CURRENT_COORDINATES);
is somehow not getting to GPU.
I'm going to tag James Overfelt, who was the master-element expert. @overfelt
I think these unit tests at one time compared the results of the old CPU only (Fortran based) master element functions to the newer CPU/GPU callable implementations. With the last refactor to get rid of all of the Fortran code these tests could be deleted.
@overfelt thank you. Do you see anything in the implementation of SCS_SHIFTED_GRAD_OP and SCS_GRAD_OP that would explain the behavior I am seeing here: https://github.com/Exawind/nalu-wind/issues/1302#issuecomment-2371754112
The code looks fine to me. It is odd that most of the computed arrays are correct and only one is all zeros. Do you happen to know the value of shiftPoisson_
in this case?
know the value of shiftPoisson_ in this case
yes, it is true for one test, false for the other. Both fail.
Fixed in #1317
I am having an issue with the following unit tests
./unittestX --gtest_filter="WallDistKernelHex8Mesh.NGP_wall_dist*"
They are throwing:
In digging further it appears that
v_dndx(ip, ic, j)
here is zero (v_scs_areav(ip, j)
is non-zero). Why is that?v_dndx
comes fromconst auto& v_dndx = shiftPoisson_ ? meViews.dndx_shifted : meViews.dndx;
, which should be defined byBut why would those be zero? This test seems fine on CPU.
I did fix a (minor) thing in the way the
expect_all_near
was being called. But that's not the issue. See PR #1301.Tagging @psakievich, @rcknaus and @alanw0 as people who might know what's going on.