GEOS-ESM / ESMA_cmake

Custom CMake macros for the GEOS Earth System Model
Apache License 2.0
4 stars 9 forks source link

Moving to use -march=coreavx2 instead of xCORE-AVX2 with Intel Fortran? #240

Closed mathomp4 closed 2 years ago

mathomp4 commented 2 years ago

With the proliferation of a lot of AMD EPYC (Rome) nodes at NAS, we might want to change the architecture flags for Intel Fortran in GEOS. Currently, we use -xCORE-AVX2 on Intel processors, but this has the problem that it uses instructions that don't exist on the AMD chips.

On AMD we can use -march=core-avx2 and this will work on both Intel (Haswell+) and AMD Rome with no changes needed. (I'm not sure if they'd be non-zero-diff between Intel and AMD, but they should run. This needs to be tested at NAS.)

But, it is non-zero-diff and possibly slower if we are somehow crucially using one of the AVX2 instructions only in -xCORE-AVX2. I'm doing some runs now to see if I see a performance hit.

mathomp4 commented 2 years ago

Here are some (on-going) results.

These are 1-day runs of GEOSgcm on the Cascade Lakes at NCCS with no history and no checkpointing. I built each as both Release and Aggressive and these are Model Throughput in days/day.

Resolution Release xCore Release march
C360 L072 135.384 139.923
C360 L181 53.638 54.865
C720 L072 57.344 58.237
C720 L181 22.058 22.354
Resolution Agg xCore Agg march
C360 L072 154.028 159.308
C360 L181 60.027 62.031
C720 L072 65.070 66.079
C720 L181 24.852 25.392
mathomp4 commented 2 years ago

Pending test by @aoloso and myself, I think we might recommend to @wmputman and @sdrabenh to update the arch flag for Intel Fortran. Everything seems pretty good performance wise.

mathomp4 commented 2 years ago

Tests at NAS have shown that if we use -march=core-avx2 then we gain quite a bit of "ease" with GEOS.

I built GEOSgcm using -march=core-avx2 once on pfe (Intel chip) and once on a Rome node (AMD chip). I then made four experiments:

  1. Build on Intel, Run on Intel
  2. Build on AMD, Run on Intel
  3. Build on Intel, Run on AMD
  4. Build on AMD, Run on AMD

When all was done, 1 == 2 and 3 == 4. That is, no matter where you build, you can get the same answers on the same architecture.

Of course, a run on AMD will never be zero-diff to a run on Intel, but at least we have "weak" form of equivalence. (Or "strong"? Maybe @tclune and I need to come up with the strong/weak version of "running on different architectures" 😄 )