P3 performance analysis

This issue documents some basic findings about P3's performance on the CPU and suggests action items for future performance work.

I was curious what the primary cost in the C++ P3 code is. It turns out to be

https://github.com/E3SM-Project/scream/blob/46ff6b3cdabd0b8e86d1e05ce89f63f5e51ec53b/components/scream/src/physics/p3/p3_rain_sed_impl.hpp#L111
the equivalent in ice sedimentation.

In particular, while one might guess that the upwind impl could be slow, it is not: calc_first_order_upwind_step is < 4% of the total P3 cost. In contrast, the rain and ice fall velocity calculations are very roughly 80%.

Possible action items:

Profile using an Intel tool at the line level, starting with rain sedimentation. (1) Are there a few costly lines, e.g., a slow tgamma impl, or instead (2) is the cost per line fairly uniform over the whole velocity computation?
If 2, then try a few different modifications to the Mask implementation: different integer sizes for the mask slots; different implementations (e.g. ternary op vs if) for the masked ops.
If there is no big change, profile with pack size 1 to see if that reveals anything.
Try a pack-free impl, using scalarize to produce 1D views of reals as inputs. This is a mask-intensive region of code, and the C++ compiler might not be able to handle it well.

E3SM-Project / scream

P3 performance analysis #1722