In particular, while one might guess that the upwind impl could be slow, it is not: calc_first_order_upwind_step is < 4% of the total P3 cost. In contrast, the rain and ice fall velocity calculations are very roughly 80%.
Possible action items:
Profile using an Intel tool at the line level, starting with rain sedimentation. (1) Are there a few costly lines, e.g., a slow tgamma impl, or instead (2) is the cost per line fairly uniform over the whole velocity computation?
If 2, then try a few different modifications to the Mask implementation: different integer sizes for the mask slots; different implementations (e.g. ternary op vs if) for the masked ops.
If there is no big change, profile with pack size 1 to see if that reveals anything.
Try a pack-free impl, using scalarize to produce 1D views of reals as inputs. This is a mask-intensive region of code, and the C++ compiler might not be able to handle it well.
This issue documents some basic findings about P3's performance on the CPU and suggests action items for future performance work.
I was curious what the primary cost in the C++ P3 code is. It turns out to be
In particular, while one might guess that the upwind impl could be slow, it is not: calc_first_order_upwind_step is < 4% of the total P3 cost. In contrast, the rain and ice fall velocity calculations are very roughly 80%.
Possible action items:
if
) for the masked ops.scalarize
to produce 1D views of reals as inputs. This is a mask-intensive region of code, and the C++ compiler might not be able to handle it well.