E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

Cam: slow(?) builds on Summit #2980

Closed sarats closed 5 years ago

sarats commented 5 years ago

Opening issue to continue discussion from https://github.com/E3SM-Project/E3SM/pull/2973

It looks like 'cam' used to build much faster than it does now.

Using PGI compilers. June 7 log: cam built in 5274.946738 seconds.

Does this correlate to build times on other platforms? For reference, GMAKE_J is 32 on Summit and hasn't been changed lately.

New variables: Cmake, RRTMG (?)

Apr 3 log: cam built in 471 sec (needs confirmation) Trying to figure out pre-cmake version that's good for comparison? Any recent stuff under cam which substantially increased the build time?

@jgfouca Any pointers to make it an apples-to-apples comparison?

jgfouca commented 5 years ago

@sarats , try checking out the commit right before the first CMake merge:

7773f8686ef83d2075ae573958a850c03f8047a6

If the commit above builds fast, then confirm it was CMake by trying this commit:

b9323090292e02802fecac624313850b71ed6fe1
jgfouca commented 5 years ago

Be wary of high load on your login node since that will ruin the measurement.

worleyph commented 5 years ago

I'm repeating the experiments on Summit, also monitoring the load on the login nodes (using w). The loads are not high (between 1 and 7), and the issue continues. Looking in the bld directory, apparently the ATM source files finish compiling after around 6 minutes, and then NOTHING happens for another 1.5 hours? The timestamp for the CMakeFiles subdirectory reflects the end time, but none of th directories or the files within them have any timestamps later than those on the ATM object files.

Looking at the "Total time" statistics in the atm.bldlog, this adds up to 38 minutes (so more than the 6 minutes from looking at object files timestamps?), but still much less than 90 minutes.

Any idea where the time is being spent? The log files don't have sufficient timestamps to differentiate the phases of the compilation, at least not that I can figure out.

worleyph commented 5 years ago

ps says that I have spent over an hour in llc at the moment (still compiling atm):

121779 pts/0 01:01:31 llc

what is the llc process?

worleyph commented 5 years ago

Found llc with a grep in a completed build as ...

 CMakeFiles/CMakeOutput.log:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-18.10-w2qkfaill2olnyrfykr5a3kmwaz6pjur/linuxpower/18.10/share/llvm/bin/llc /tmp/pgfortraniEJyyVf_8qeX.ll -mcpu=native -O0 -fast-isel=0 -disable-ppc-preinc -code-model=large -o /tmp/pgfortraniEJyyQZYkIp-.s

but this log file has a timestamp from the beginning of the build, not the end, and llc does not show up in the atm.bldlog. FYI.

sarats commented 5 years ago

@worleyph I noticed a similar behavior. I tracked down the process tree for llc which seems to stem from compiling rrtmg: rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90.o

One avenue is to check standalone compile times for rrtmg to debug this further.

Using the following compile command for "rrtmg_sw/rrtmg_sw_k_g.f90" in build_dir/atm/obj is taking a long time.

/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-18.10-w2qkfaill2olnyrfykr5a3kmwaz6pjur/linuxpower/18.10/bin/pgfortran -DBIT64 -DCAM -DCLUBB_CAM -DCLUBB_REAL_TYPE=dp -DCLUBB_SGS -DCO2A -DCPRPGI -DFORTRANUNDERSCORE -DHAVE_COMM_F2C -DHAVE_F2003_PTR_BND_REMAP -DHAVE_GETTIMEOFDAY -DHAVE_MPI -DHAVE_NANOTIME -DHAVE_SLASHPROC -DHAVE_TIMES -DHAVE_VPRINTF -DLINUX -DLSMLAT=1 -DLSMLON=1 -DMAXPATCH_PFT=numpft+1 -DMCT_INTERFACE -DMODAL_AER -DMODAL_AERO -DMODAL_AERO_4MODE_MOM -DNC=4 -DNDEBUG -DNO_LAPACK_ISNAN -DNO_R16 -DNO_SHR_VMATH -DNP=4 -DN_RAD_CNST=30 -DPCNST=40 -DPCOLS=16 -DPIO1 -DPLAT=1 -DPLEV=72 -DPLON=866 -DPSUBCOLS=1 -DPTRK=1 -DPTRM=1 -DPTRN=1 -DRAIN_EVAP_TO_COARSE_AERO -DSPMD -D_MPI -D_PRIM -D_WK_GRAD -i4 -gopt -time -Mstack_arrays -Mextend -byteswapio -Mflushz -Kieee -Mallocatable=03 -O2 -DSUMMITDEV_PGI -I. -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/pgi/spectrum-mpi/nodebug/nothreads/mct/include -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/pgi/spectrum-mpi/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/netcdf-4.6.1-rm3pbg7vsmxrqrskto27z2pgvogvhcvw/include -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/netcdf-fortran-4.4.4-xwpszk4dbh7aedfpkeg3nauczudncscy/include -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/pgi/spectrum-mpi/nodebug/nothreads/mct/include -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/pp_linoz_mam4_resus_mom_soag -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/modal_aero -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/aerosol -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/mozart -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/utils -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_mcica -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_lw -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_sw -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/cam -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/clubb -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/dynamics/se -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/homme/src/share -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/homme/src/preqx -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/homme/src/preqx/share -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/cpl -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/control -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/utils -I. -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/lib/include -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/lnd/obj -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/lnd/obj -Mfree -O0 -c /autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90 -o CMakeFiles/atm.dir/cam/src/physics/rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90.o -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/spectrum-mpi-10.3.0.0-20190419-femacxmvjgm5iwt444yxbk2merwcz6s6/include -lpthread -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/spectrum-mpi-10.3.0.0-20190419-femacxmvjgm5iwt444yxbk2merwcz6s6/lib/PGI

Timing stats: init 1366 millisecs 0% import 2984 millisecs 0% expand 1935 millisecs 0% vectorize 21646 millisecs 4% optimize 6984 millisecs 1% schedule 494266 millisecs 92% unroll 6469 millisecs 1% Total time 535650 millisecs

But it practically hangs there (I presume in the last step in processing the assembly file using llc - a child process from the compile) .

/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-18.10-w2qkfaill2olnyrfykr5a3kmwaz6pjur/linuxpower/18.10/share/llvm/bin/llc /tmp/pgfortranU8NekgEg2-wH.llvm -mcpu=native -O2 -disable-ppc-preinc -code-model=large -o /tmp/pgfortran_8Ne6o8VTb4Q.s

jgfouca commented 5 years ago

That's one of the files where optimization needs to be disabled for it to compile in a reasonable time. Did it try to optimize this file?

sarats commented 5 years ago

Yes, there is -O2 flag. So, are you disabling optimization in a Depends file on other platforms? What other files are known trouble-makers in the same vein?

jgfouca commented 5 years ago

@sarats , the -O2 flag is expected, but there should be a -O0 flag later in the compile line that overrides it.

The troublemakers on all platforms are (from components/CMakeLists.txt):

set(NOOPT_FILES "cam/src/physics/rrtmg/ext/rrtmg_lw/rrtmg_lw_k_g.f90;cam/src/physics/rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90")

And summit also disables optimizations on additional files (Depends.summit.cmake):

list(APPEND NOOPT_FILES
  cam/src/dynamics/eul/dyn_comp.F90
  cam/src/dynamics/fv/dyn_comp.F90
  cam/src/dynamics/se/dyn_comp.F90
  cam/src/dynamics/sld/dyn_comp.F90
  cam/src/physics/cam/microp_aero.F90)
worleyph commented 5 years ago

So, to summarize, it appears that compilation of rrtmg_sw_k_g.f90 is the bottleneck . I just looked at my compile line, and it does have -O0 (after -O2). So does rrtmg_lw_k_g.f90 . There are 4 files total with the -O0 flag, others being dyn_comp.F90 and microp_aero.F90.

Found one comment about "slow performance in llc.exe", but nothing like the 90 mintues we are seeing. From that one comment (http://llvm.1065342.n5.nabble.com/llvm-dev-slow-performance-in-llc-exe-to-do-with-large-global-floating-point-arrays-td128302.html):

 We are finding some pretty bad compiler performance in some cases.  
 For example, this github gist contains a bitcode file which is a neural 
 network compiled by ELL and it has about 30mb of floating point data, 
 and when we put that through llc it takes 262 seconds to compile
  (on an Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz), but if we strip out 
 the weights the "code" component of our neural network inference takes 
 only 2 seconds to compile.

So, rrtmg_sw_k_g.F90 has LOTS on data, maybe 3.8GB when measured via ascii representation. Perhaps this is the same issue.

Only response to the above observation was

 Have you tried -time-passes to see where it’s actually spending time?  
 I don’t think there’s currently a timer that covers printing global variables
  to assembly, but you should be able to rule out something else.

This was May 9, 2019. @sarats, it would be interesting to see if the IBM compiler has the same behavior. Does it also use LLVM?

worleyph commented 5 years ago

Note that rrtmgp reads in the data (see physics/rrtmgp/external/rrtmgp/data/). Perhaps it will not have the same problem?

worleyph commented 5 years ago

... 3.8GB ... (not 3.8GM)

sarats commented 5 years ago

Yes, that's my plan as well - to see how XL behaves. I didn't get to it today. We probably should package a few files along with the troublemaker for a reproducer and report to PGI.

sarats commented 5 years ago

@worleyph I agree the issue is with initializing floating point data.

On a hunch, I tried PGI 19.4 and the build time improved substantially. ne4 ran fine. ne30 is running presently. Previously, we had problem with moving to 19.3 (?) but things seem to run okay with 19.4 now.

ne4:

 cam built in 613.689390 seconds
 clm built in 730.121472 seconds
 Time spent not building: 3.063171 sec
 Time spent building: 1038.593147 sec

ne30:

cam built in 586.199732 seconds
clm built in 750.239044 seconds
Time spent not building: 2.837199 sec
Time spent building: 1056.434291 sec
worleyph commented 5 years ago

According to Ed D'Azevedo, on Titan use of the LLVM was an option when using PGI (if I remember Ed's comment correctly) and was not used by default. Perhaps it is still an option (that can be disabled) on Summit? Might ask your contacts at the OLCF.

worleyph commented 5 years ago

Just realized that 19.4 is a newer compiler version (I was thinking 18.4 :-) .) Never mind. Newer is (sometimes) better.

sarats commented 5 years ago

Edited: I don't think the LLVM backend is optional anymore.