E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
352 stars 360 forks source link

BUILD_THREADED == TRUE breaks CAM when no threading in ATM #1382

Closed worleyph closed 7 years ago

worleyph commented 7 years ago

a) A_ or F case b) env_mach_pes.xml with NTHRDS==1 for all components c) BUILD_THREADED == TRUE in env_build.xml d) runtime error in ATM:

  dyn_init1: number of OpenMP threads =             1
  ERROR: Error: threaded runs require -DHORIZ_OPENMP

e) Relevant code (in dyn_comp.F90)

 #ifdef _OPENMP
 !   Set by driver
     nthreads = omp_get_max_threads()
     if(par%masterproc) then
        write(iulog,*) " "
        write(iulog,*) "dyn_init1: number of OpenMP threads = ", nthreads
        write(iulog,*) " "
     endif
 #ifndef HORIZ_OPENMP
     call endrun('Error: threaded runs require -DHORIZ_OPENMP')
 #endif

So, setting BUILD_THREADED to TRUE does not also add -DHORIZ_OPENMP to the CAM CPPDEFS, and the code then breaks when CAM is in fact not threaded (NTHRD == 1).

Also tried ./case.setup -c; ./case.setup, ./case.build --clean atm, ./case.build and nothing changed.

amametjanov commented 7 years ago

Could you give a use-case-scenario when it'd be good to have BUILD_THREADED==TRUE but all components' NTHRDS==1? I think it's best to control whether a component is built threaded or not through the condition NTHRDS==1 or NTHRDS>1, rather than through BUILD_THREADED variable. The overhead of threading is not justified with 1 thread.

worleyph commented 7 years ago

This situation arose as part of issue #1380 - I did not do this deliberately. That said, the error will also occur when only CAM has NTHRDS==1, not all components, when BUILD_THREADED==TRUE.

We have used BUILD_THREADED == TRUE for compiler debugging, and every once in awhile the code will not work without threading support and will when built with threading support.

worleyph commented 7 years ago

Could you just add

 #ifndef HORIZ_OPENMP
       if (nthreads > 1) then
           call endrun('Error: threaded runs require -DHORIZ_OPENMP')
       endif
 #endif

? While it may not be useful to build with threading support when only using one thread, I see no reason to kill the code in this case.

amametjanov commented 7 years ago

Could you make this change and see if your runs complete without any errors? I think there might be other remaining places with #ifdef _OPENMP that might cause issues, because NTHRDS is 1. If there are no errors, then I'll add it to git master and we'll have a distinction between "single-threaded" and "pure-MPI". :)

worleyph commented 7 years ago

@amametjanov , this fix worked fine for me (and no other changes were needed). My test was

 -compset FC5AV1C-L -res ne16_ne16 -compiler intel

with 64x1 stacked PE layout (the default) on Titan. Tried both BUILD_THREADED == TRUE and BUILD_THREADED == FALSE, and atm.log (through day one) was BFB.

amametjanov commented 7 years ago

Previously, if NTHRDS>1 or BUILD_THREADED=TRUE, then $ENV{'SMP'}=TRUE. If $ENV{'SMP'}=TRUE, then CPP macros were appended -DHORIZ_OPENMP. So this looks like a CIME regression, where SMP is not set when forcing a threaded build.

Re-assigning to @jgfouca, because of the related issue #1380.

worleyph commented 7 years ago

@amametjanov , when did the CPP macros get appended? Does this require ./case.setup -c; ./case.setup if env_build.xml is changed, or does ./case.build check this and do the right thing?

amametjanov commented 7 years ago

Here: https://github.com/ACME-Climate/ACME/blob/master/components/cam/bld/configure#L1968 . I think case.setup -r is required for settings to take effect.

worleyph commented 7 years ago

Unfortunate that this is in env_build.xml then. The comments in env_build.xml imply that the only sensitivity for env_build.xml variables is whether the model has been built or not.

worleyph commented 7 years ago

@amametjanov , I think it might be best to add my fix to the code and not to change the CIME logic. You only need -DHORIZ_OPENMP when nthreads > 1 for ATM (with this fix). If you change env_mach_pes.xml, you have to ./case.setup -r (or -c or whatever) and this will change as needed. But if you change BUILD_THREADED in env_build.xml you shouldn't have to do this.

worleyph commented 7 years ago

Note that changing BUILD_THREADED from FALSE to TRUE and then ./case.build --clean lnd followed by ./case.build, -openmp is added to the compile line for the LND build. So ./case.setup -r is not required.

jgfouca commented 7 years ago

I believe my fix for #1380 also fixes this.

worleyph commented 7 years ago

Needs to be tested in ACME though. Will changing BUILD_THREADED also change the CPP flags used in the build? I don't know when the cam configure is called ... if part of the build process, then this should be okay?

amametjanov commented 7 years ago

Pat, HORIZ_OPENMP has to be set any time -openmp is present in compile flags. I think you are trying to avoid having to re-build, because case.setup --reset forces you to re-build. If a variable that affects all component builds is changed, then re-build is required anyway.

Jim, it would help if you also reference ESMCI issue numbers for us to check the fixes. The turn-around time of waiting for ESMCI updates to come back to ACME is much longer.

worleyph commented 7 years ago

Pat, HORIZ_OPENMP has to be set any time -openmp is present in compile flags.

But it doesn't - that is the reason for the fix proposed above.

I just don't want to force the user to do a ./case.setup -r when changing something in env_build.xml. This is nonintuitive, and disagrees with the existing comments.

amametjanov commented 7 years ago

It wasn't set because of the regression with the SMP variable when converting from perl to python. If that is fixed, then we are back to how CAM-SE used to build:

worleyph commented 7 years ago

So, what will happen (once things are back to normal) if BUILD_THREADED is set to TRUE after ./case.setup but before ./case.build? Will CAM abort with the error message

 dyn_init1: number of OpenMP threads =             1
 ERROR: Error: threaded runs require -DHORIZ_OPENMP

or will -DHORIZ_OPENMP be added to the CPPFLAGS during ./case.build?

amametjanov commented 7 years ago

With current git master, case.build overwrites CAM's configure options built after case.setup. So the macro should be added during case.build.

worleyph commented 7 years ago

Thanks.

worleyph commented 7 years ago

@amametjanov , FYI, I will need to use the above fix until the relevant ESMCI merge takes place. The scenario is using threading in OCN but not in ICE, but needing ICE framework to be built with threading support. Setting BUILD_THREADED to TRUE is the only way that I know to make that happen. In this case, CAM will die (because it also has only one thread).

amametjanov commented 7 years ago

Another way is set that macro manually:

./xmlchange -file env_build.xml CAM_CONFIG_OPTS="-cppdef -DHORIZ_OPENMP" -append

which would then enable threading in CAM. Otherwise, CAM runs in pure-MPI instead of single-threaded mode (as intended by BUILD_THREADED=true).

BTW, a reminder that ICE in git master can also use 2+ threads.

worleyph commented 7 years ago

Thanks. In this case, CICE is not the performance bottleneck (OCN is) and is fine as MPI-only. I'm also not sure what version of the code I am using - came from BGC group with custom source modifications.

worleyph commented 7 years ago

@jgfouca and @amametjanov , this is still broken. I changed env_mach-specific.xml to use 1 thread in ATM, changed BUILD_THREADED to TRUE in env_build.xml, ./case.setup -c; ./case.setup; ./case.build --clean-all; ./case.build; ./case.submit, and still got the error:

dyn_init1: number of OpenMP threads = 1

ERROR: Error: threaded runs require -DHORIZ_OPENMP

jgfouca commented 7 years ago

@worleyph , I am unable to reproduce this error. I made a simple case and it looks like build_threaded and SMP are being handled correctly. Could you point me to a machine+case where you are seeing this?

worleyph commented 7 years ago

I'll try to reproduce as soon as I get the chance. I hope that I am mistaken.

worleyph commented 7 years ago

On Titan:

  ./create_newcase -case Test -compset A_WCYCL2000 -res ne30_oEC -mach titan -compiler intel -project cli115
  cd Test
  emacs env_mach_pes.xml
  emacs env_build.xml
  ./case.setup
  ./case.build
  ./case.submit

where in env_mach_pes.xml I changed

    <value component="ATM">2</value>

to

    <value component="ATM">1</value>

)

and in env_build.xml I changed

    <entry id="BUILD_THREADED" value="FALSE">

to

    <entry id="BUILD_THREADED" value="TRUE">

)

Job died during ATM initialization with

  dyn_init1: number of OpenMP threads =            1

  ERROR: Error: threaded runs require -DHORIZ_OPENMP
jgfouca commented 7 years ago

@worleyph, I am unable to duplicate the error. Is it possible your ACME repo is not up to date?

% case.build
...
Component csm_share build complete with 17 warnings
         - Building atm Library 
Building atm with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/atm.bldlog.170424-162201
         - Building lnd Library 
Building lnd with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/lnd.bldlog.170424-162201
         - Building ice Library 
Building ice with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/ice.bldlog.170424-162201
         - Building ocn Library 
Building ocn with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/ocn.bldlog.170424-162201
         - Building rof Library 
Building rof with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/rof.bldlog.170424-162201
         - Building glc Library 
Building glc with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/glc.bldlog.170424-162201
         - Building wav Library 
Building wav with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/wav.bldlog.170424-162201
         - Building esp Library 
Building esp with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/esp.bldlog.170424-162201
Component lnd build complete with 9 warnings
Component atm build complete with 7 warnings
Component ice build complete with 46 warnings
Component ocn build complete with 45 warnings
Building acme with output to /ccs/home/acmetest/acme_scratch/cli115/Test/bld/acme.bldlog.170424-162201
Time spent not building: 69.560485 sec
Time spent building: 1523.655984 sec

% ./xmlquery NTHRDS
    NTHRDS: ['CPL:1', 'ATM:1', 'LND:1', 'ICE:1', 'OCN:1', 'ROF:1', 'GLC:1', 'WAV:1', 'ESP:1']

% ./xmlquery BUILD_THREADED
    BUILD_THREADED: TRUE
worleyph commented 7 years ago

Unless it has been corrupted it is up to date. I'll check out a fresh copy and try again.

worleyph commented 7 years ago

@jgfouca , did you run the job? The failure is a runtime failure, not a build failure.

worleyph commented 7 years ago

@jgfouca , died in the identical way. @amametjanov should speak up here ... the issue is getting a CPP variable set in this situation, so modifying Macros? (Az, where should this go?)

jgfouca commented 7 years ago

@worleyph , apologies should have looked closer. I had assume a build failure. Trying again.

amametjanov commented 7 years ago

@worleyph, just to confirm: when BUILD_THREADED=TRUE and NTHRDS_ATM=1, CAM should be run in single-threaded mode, not pure-MPI. To enable this:

diff --git a/components/cam/cime_config/buildnml b/components/cam/cime_config/buildnml
index 4c784ec..2e28b52 100755
--- a/components/cam/cime_config/buildnml
+++ b/components/cam/cime_config/buildnml
@@ -42,6 +42,7 @@ my $RUN_REFTOD                = `./xmlquery  RUN_REFTOD               -value`;
 my $UTILROOT           = `./xmlquery  UTILROOT                 -value`;
 my $DOCN_MODE          = `./xmlquery  DOCN_MODE                -value`;
 my $COMPILER           = `./xmlquery  COMPILER                 -value`; # for chem preprocessor 
+my $BUILD_THREADED     = `./xmlquery  BUILD_THREADED           -value`;

 my @dirs = ("$CIMEROOT/utils/perl5lib");
 unshift @INC, @dirs;
@@ -75,7 +76,7 @@ if ($BUILD_COMPLETE eq 'FALSE') {
     my $spmd = '-spmd';
     if ($MPILIB eq 'mpi-serial') {$spmd = "-nospmd";}
     my $smp = '-smp';
-    if ($NTHRDS_ATM == 1) {$smp = "-nosmp";}
+    if ($NTHRDS_ATM == 1 && $BUILD_THREADED eq 'FALSE') {$smp = "-nosmp";}

     # The ocean component setting is only used by CAM to do attribute matching for
     # setting default tuning parameter values.  In SOM mode we want to use the same
worleyph commented 7 years ago

@amametjanov , BUILD_THREADED is not in components/cam/cime_config/buildnml in master. @jgfouca marked this as closed after the merge of CIME5.3. Is there another PR that contains this fix?

jgfouca commented 7 years ago

@worleyph , I ran the problem and was able to reproduce the error. As @amametjanov mentioned, this appears to be an issue with the cam component and not a problem with CIME infrastructure, so I'm going to unassign myself from the ticket.

rljacob commented 7 years ago

Its might be the interaction between CAM's buildnml and CIME.