E3SM-Project / scream

Exascale global atmosphere model written in C++ as part of the E3SM project
https://e3sm-project.github.io/scream/
Other
70 stars 48 forks source link

P3 performance analysis #1722

Open ambrad opened 2 years ago

ambrad commented 2 years ago

This issue documents some basic findings about P3's performance on the CPU and suggests action items for future performance work.

I was curious what the primary cost in the C++ P3 code is. It turns out to be

  1. https://github.com/E3SM-Project/scream/blob/46ff6b3cdabd0b8e86d1e05ce89f63f5e51ec53b/components/scream/src/physics/p3/p3_rain_sed_impl.hpp#L111
  2. the equivalent in ice sedimentation.

In particular, while one might guess that the upwind impl could be slow, it is not: calc_first_order_upwind_step is < 4% of the total P3 cost. In contrast, the rain and ice fall velocity calculations are very roughly 80%.

Possible action items:

  1. Profile using an Intel tool at the line level, starting with rain sedimentation. (1) Are there a few costly lines, e.g., a slow tgamma impl, or instead (2) is the cost per line fairly uniform over the whole velocity computation?
  2. If 2, then try a few different modifications to the Mask implementation: different integer sizes for the mask slots; different implementations (e.g. ternary op vs if) for the masked ops.
  3. If there is no big change, profile with pack size 1 to see if that reveals anything.
  4. Try a pack-free impl, using scalarize to produce 1D views of reals as inputs. This is a mask-intensive region of code, and the C++ compiler might not be able to handle it well.
PeterCaldwell commented 1 year ago

@hr203 - this might be a good topic for you!