ESMCI / ccs_config_cesm

CESM CIME Case Control System configuration files
3 stars 43 forks source link

Turn off FMA option for the intel and nvhpc compiler #121

Closed sjsprecious closed 11 months ago

sjsprecious commented 11 months ago

This PR turns off the FMA option by default to the intel and nvhpc compilers. Based on the ensemble consistency test (ECT), turning on FMA is likely to generate a statistically different climatology.

This makes sure that the ECT is passed when using the intel/2023 compiler on Derecho for the test simulations and comparing the results to the baseline generated on Cheyenne with intel/19.1.1. See more discussions here: https://github.com/ESCOMP/CAM/issues/883.

sjsprecious commented 11 months ago

I guess the options like "--host=cray" and "-march=core-avx2" are specific to Derecho (Cray machine and AMD CPU)?

jedwards4b commented 11 months ago

Can I push that change back to your PR? Move no-fma flag to generic intel.cmake file.

sjsprecious commented 11 months ago

Oh, sorry that I misunderstand your point. You mean we should turn off FMA whenever we use the intel compiler on any machine? That is fine with me but I am not sure if other people want to leave it on for their machines (e.g., people do no use ECT for verification).

sjsprecious commented 11 months ago

Machines like constance and izumi all turn on FMA by default. Moving -no-fma to the intel.cmake file will break the regression test on those machines and it may not be desirable.

jedwards4b commented 11 months ago

Are we running ect on those systems? Seems like the ect test confirms that we should not be using fma on any system.

sjsprecious commented 11 months ago

I agreed that based on ECT, we should not use FMA for any system (intel, nvhpc, etc). I do not have access to those systems but I assume that they will run some regressions tests like aux_cam for BFB. If we add FMA to the generic file, their tests will fail and they may not want that if they do not trust ECT like we do. This is my concern.

briandobbins commented 11 months ago

If it fails the ECT with FMA on, I'd be inclined to think it should be off everywhere, no? Not just the systems where we run the ECT. Failing means it's a statistically different climatology.

On Fri, Sep 15, 2023 at 12:04 PM Jian Sun @.***> wrote:

I agreed that based on ECT, we should not use FMA for any system. I do not have access to those systems but I assume that they will run some regressions tests like aux_cam for BFB. If we add FMA to the generic file, their tests will fail and they may not want that if they do not trust ECT like we do. This is my concern.

— Reply to this email directly, view it on GitHub https://github.com/ESMCI/ccs_config_cesm/pull/121#issuecomment-1721655897, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACL2HPJ7CNWF2QOFWVSEC2DX2SKEHANCNFSM6AAAAAA42C76F4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jedwards4b commented 11 months ago

We will mark the PR as answer changing so that people understand that baselines may fail.

sjsprecious commented 11 months ago

Thanks @jedwards4b and @briandobbins for your comments. That sounds good to me.

My last comment is how we could make sure that -no-fma is added as the last option? Each system applies some specific flags and I do not want the -no-fma flag to be overwritten by other optimization flags.

sjsprecious commented 11 months ago

A quick test on Derecho by moving -no-fma from the machine specific file to the generic intel file passes the ECT. Thus I will move on to do what @jedwards4b and @briandobbins suggested here.

If anyone finds out an issue about the changes here, they can always modify the flags specifically for their own system.

fischer-ncar commented 11 months ago

It looks like the FMA option was also causing my ERP/PEM/ERC tests to fail. I still need to do more ERP/PEM/ERC testing.

jedwards4b commented 11 months ago

That totally makes sense since changing task count with fma enabled will change answers.

sjsprecious commented 11 months ago

@jedwards4b can you explain more about why changing task count with FMA enabled will change answers? I thought changing task count just affected the domain decomposition.

jedwards4b commented 11 months ago

I misspoke there - I was thinking about vector math and how using different pelayouts changes the length of the vectors - but that doesn't have anything to do with FMA.

sjsprecious commented 11 months ago

Thanks Jim for your clarification. I assumed Chris was doing the ERP test on Derecho and my understanding was that its failure should not be caused by the FMA option. But it seems that turning off FMA on Derecho passes the ERP test somehow?

jedwards4b commented 11 months ago

Maybe we should confirm that with another test?

sjsprecious commented 11 months ago

Agreed and I think Chris is already working on it? I would like to let Chris update more details here in case I have a misunderstanding.

fischer-ncar commented 11 months ago

I reran the prealpha tests over the weekend. These following tests were all failing before FMA was turned off.

PASS ERP_Ld3_Vnuopc.f09_f09_mg17.FCfireHIST.derecho_intel.cam-outfrq1d COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F1850.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F2000dev.derecho_intel.cam-outfrq9s_mg3 COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.F2010climo.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.FHIST_BGC.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.FHIST.derecho_intel.cam-outfrq9s COMPARE_base_rest
PASS ERP_Ln9_Vnuopc.f09_f09_mg17.FTJ16.derecho_intel.cam-outfrq9s COMPARE_base_rest
sjsprecious commented 11 months ago

Thanks Chris for posting these new results. So turning off FMA does help pass the ERP tests on Derecho somehow.

Are there any other type of tests besides ERP that also pass due to the disablement of FMA?