Closed timfelle closed 6 hours ago
Nice, but one concern is how this would affect inlining, that was what killed #709
Well since this is just to avoid us manually writting something like the example bellow, I would not expect any differences in terms of inlining. But sure, there might be some compilers that only look 1 function deep to determine inlining. I am not exactly sure how they determine that.
if (NEKO_BCKND_DEVICE .eq. 1) then
call device_add2(.....)
else
call add2(.....)
end if
Nice, but one concern is how this would affect inlining, that was what killed #709
Well since this is just to avoid us manually writting something like the example bellow, I would not expect any differences in terms of inlining. But sure, there might be some compilers that only look 1 function deep to determine inlining. I am not exactly sure how they determine that.
if (NEKO_BCKND_DEVICE .eq. 1) then call device_add2(.....) else call add2(.....) end if
In the past the issue was more inline at the call site, eg we often have a loop calling multiple add2
, since math is rather pure, these would be inlined and the loops (from both add2
) fused. Would be good to check before we merge this at least
In the past the issue was more inline at the call site, eg we often have a loop calling multiple
add2
, since math is rather pure, these would be inlined and the loops (from bothadd2
) fused. Would be good to check before we merge this at least
Ahh sure thing. Yes i would not expect kernel fusing to function as well with this. I still think it can be worth merging, but we should probably write a note that it's only recommended for single operations and not for building large complexes that are performance critical.
Yes, this would certainly be a most welcome addition if the performance issues can be solved. I think there are quite a bit single-statement math calls in the code, so if fusion is the only problem, this would still be quite useful.
For user experience this is of course great. I would propose that we have something like this or even a generic idea like #709 to also cover none field data, and for performance we do as in the OpenMP branch, where we avoid calling math routines in performance critical parts (more or less entire core), and let the math stuff be more user facing.
Edit: apparently already mentioned here: https://github.com/ExtremeFLOW/neko/issues/706#issuecomment-2081631615
This would allow us to offer OpenMP parallelisation on CPUs for those who want (e.g. A64FX or Grace-Grace) users
For user experience this is of course great. I would propose that we have something like this or even a generic idea like #709 to also cover none field data, and for performance we do as in the OpenMP branch, where we avoid calling math routines in performance critical parts (more or less entire core), and let the math stuff be more user facing.
Edit: apparently already mentioned here: #706 (comment)
This would allow us to offer OpenMP parallelisation on CPUs for those who want (e.g. A64FX or Grace-Grace) users
Essentially one should be able to do a generic interface which wraps up math
, device_math
and field_math
modules. The only reason that is not part of this PR is the implicit casts from 4D to 1D arrays in the current math
module. I am aware this is needed for performance reasons, but the generic interface will fail to compile as it cannot determine which interface matches.
In general I think the CPU versions should be elemental functions. Particularly for most of these operations. Do we have any idea about the state of compilers for that? It would eliminate the commpilation issue as far as i understand it.
Introduction of a
field_math
library to capture the cpu and device operations.This is related to the old issue #706.
I did attempt the generic interface which should automatically select the cpu, field or device operation based on input types. However, due to our implicit flattening of arrays this did not work, without introducting cpu operations for 1, 2, 3 and 4 dimensions.
Additionally I used the new library to hide some of the
if (NEKO_BCKND_DEVICE .eq. 1)
statements and cleanup the code for the brinkman sourceterm and the mean field.