Moving to use -march=coreavx2 instead of xCORE-AVX2 with Intel Fortran?

GEOS-ESM / ESMA_cmake

Custom CMake macros for the GEOS Earth System Model

Apache License 2.0

4 stars 9 forks source link

Moving to use -march=coreavx2 instead of xCORE-AVX2 with Intel Fortran? #240

Closed mathomp4 closed 2 years ago

mathomp4 commented 2 years ago

With the proliferation of a lot of AMD EPYC (Rome) nodes at NAS, we might want to change the architecture flags for Intel Fortran in GEOS. Currently, we use -xCORE-AVX2 on Intel processors, but this has the problem that it uses instructions that don't exist on the AMD chips.

On AMD we can use -march=core-avx2 and this will work on both Intel (Haswell+) and AMD Rome with no changes needed. (I'm not sure if they'd be non-zero-diff between Intel and AMD, but they should run. This needs to be tested at NAS.)

But, it is non-zero-diff and possibly slower if we are somehow crucially using one of the AVX2 instructions only in -xCORE-AVX2. I'm doing some runs now to see if I see a performance hit.

mathomp4 commented 2 years ago

Here are some (on-going) results.

These are 1-day runs of GEOSgcm on the Cascade Lakes at NCCS with no history and no checkpointing. I built each as both Release and Aggressive and these are Model Throughput in days/day.

Resolution	Release xCore	Release march
C360 L072	135.384	139.923
C360 L181	53.638	54.865
C720 L072	57.344	58.237
C720 L181	22.058	22.354

Resolution	Agg xCore	Agg march
C360 L072	154.028	159.308
C360 L181	60.027	62.031
C720 L072	65.070	66.079
C720 L181	24.852	25.392

mathomp4 commented 2 years ago

Pending test by @aoloso and myself, I think we might recommend to @wmputman and @sdrabenh to update the arch flag for Intel Fortran. Everything seems pretty good performance wise.

mathomp4 commented 2 years ago

Tests at NAS have shown that if we use -march=core-avx2 then we gain quite a bit of "ease" with GEOS.

I built GEOSgcm using -march=core-avx2 once on pfe (Intel chip) and once on a Rome node (AMD chip). I then made four experiments:

Build on Intel, Run on Intel
Build on AMD, Run on Intel
Build on Intel, Run on AMD
Build on AMD, Run on AMD

When all was done, 1 == 2 and 3 == 4. That is, no matter where you build, you can get the same answers on the same architecture.

Of course, a run on AMD will never be zero-diff to a run on Intel, but at least we have "weak" form of equivalence. (Or "strong"? Maybe @tclune and I need to come up with the strong/weak version of "running on different architectures" 😄 )