ExtremeFLOW / neko

/ᐠ. 。.ᐟ\ᵐᵉᵒʷˎˊ˗
https://neko.cfd/
Other
158 stars 27 forks source link

Feature/math_field #1319

Closed timfelle closed 6 hours ago

timfelle commented 6 days ago

Introduction of a field_math library to capture the cpu and device operations.

This is related to the old issue #706.

I did attempt the generic interface which should automatically select the cpu, field or device operation based on input types. However, due to our implicit flattening of arrays this did not work, without introducting cpu operations for 1, 2, 3 and 4 dimensions.

Additionally I used the new library to hide some of the if (NEKO_BCKND_DEVICE .eq. 1) statements and cleanup the code for the brinkman sourceterm and the mean field.

timfelle commented 6 days ago

Nice, but one concern is how this would affect inlining, that was what killed #709

Well since this is just to avoid us manually writting something like the example bellow, I would not expect any differences in terms of inlining. But sure, there might be some compilers that only look 1 function deep to determine inlining. I am not exactly sure how they determine that.

if (NEKO_BCKND_DEVICE .eq. 1) then
    call device_add2(.....)
else
    call add2(.....)
end if
njansson commented 6 days ago

Nice, but one concern is how this would affect inlining, that was what killed #709

Well since this is just to avoid us manually writting something like the example bellow, I would not expect any differences in terms of inlining. But sure, there might be some compilers that only look 1 function deep to determine inlining. I am not exactly sure how they determine that.

if (NEKO_BCKND_DEVICE .eq. 1) then
  call device_add2(.....)
else
  call add2(.....)
end if

In the past the issue was more inline at the call site, eg we often have a loop calling multiple add2, since math is rather pure, these would be inlined and the loops (from both add2) fused. Would be good to check before we merge this at least

timfelle commented 6 days ago

In the past the issue was more inline at the call site, eg we often have a loop calling multiple add2, since math is rather pure, these would be inlined and the loops (from both add2) fused. Would be good to check before we merge this at least

Ahh sure thing. Yes i would not expect kernel fusing to function as well with this. I still think it can be worth merging, but we should probably write a note that it's only recommended for single operations and not for building large complexes that are performance critical.

timofeymukha commented 6 days ago

Yes, this would certainly be a most welcome addition if the performance issues can be solved. I think there are quite a bit single-statement math calls in the code, so if fusion is the only problem, this would still be quite useful.

njansson commented 6 days ago

For user experience this is of course great. I would propose that we have something like this or even a generic idea like #709 to also cover none field data, and for performance we do as in the OpenMP branch, where we avoid calling math routines in performance critical parts (more or less entire core), and let the math stuff be more user facing.

Edit: apparently already mentioned here: https://github.com/ExtremeFLOW/neko/issues/706#issuecomment-2081631615

This would allow us to offer OpenMP parallelisation on CPUs for those who want (e.g. A64FX or Grace-Grace) users

timfelle commented 6 days ago

For user experience this is of course great. I would propose that we have something like this or even a generic idea like #709 to also cover none field data, and for performance we do as in the OpenMP branch, where we avoid calling math routines in performance critical parts (more or less entire core), and let the math stuff be more user facing.

Edit: apparently already mentioned here: #706 (comment)

This would allow us to offer OpenMP parallelisation on CPUs for those who want (e.g. A64FX or Grace-Grace) users

Essentially one should be able to do a generic interface which wraps up math, device_math and field_math modules. The only reason that is not part of this PR is the implicit casts from 4D to 1D arrays in the current math module. I am aware this is needed for performance reasons, but the generic interface will fail to compile as it cannot determine which interface matches.

In general I think the CPU versions should be elemental functions. Particularly for most of these operations. Do we have any idea about the state of compilers for that? It would eliminate the commpilation issue as far as i understand it.