Closed sarats closed 5 years ago
@sarats , try checking out the commit right before the first CMake merge:
7773f8686ef83d2075ae573958a850c03f8047a6
If the commit above builds fast, then confirm it was CMake by trying this commit:
b9323090292e02802fecac624313850b71ed6fe1
Be wary of high load on your login node since that will ruin the measurement.
I'm repeating the experiments on Summit, also monitoring the load on the login nodes (using w). The loads are not high (between 1 and 7), and the issue continues. Looking in the bld directory, apparently the ATM source files finish compiling after around 6 minutes, and then NOTHING happens for another 1.5 hours? The timestamp for the CMakeFiles subdirectory reflects the end time, but none of th directories or the files within them have any timestamps later than those on the ATM object files.
Looking at the "Total time" statistics in the atm.bldlog, this adds up to 38 minutes (so more than the 6 minutes from looking at object files timestamps?), but still much less than 90 minutes.
Any idea where the time is being spent? The log files don't have sufficient timestamps to differentiate the phases of the compilation, at least not that I can figure out.
ps says that I have spent over an hour in llc at the moment (still compiling atm):
121779 pts/0 01:01:31 llc
what is the llc process?
Found llc with a grep in a completed build as ...
CMakeFiles/CMakeOutput.log:/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-18.10-w2qkfaill2olnyrfykr5a3kmwaz6pjur/linuxpower/18.10/share/llvm/bin/llc /tmp/pgfortraniEJyyVf_8qeX.ll -mcpu=native -O0 -fast-isel=0 -disable-ppc-preinc -code-model=large -o /tmp/pgfortraniEJyyQZYkIp-.s
but this log file has a timestamp from the beginning of the build, not the end, and llc does not show up in the atm.bldlog. FYI.
@worleyph I noticed a similar behavior. I tracked down the process tree for llc which seems to stem from compiling rrtmg: rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90.o
One avenue is to check standalone compile times for rrtmg to debug this further.
Using the following compile command for "rrtmg_sw/rrtmg_sw_k_g.f90" in build_dir/atm/obj is taking a long time.
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-18.10-w2qkfaill2olnyrfykr5a3kmwaz6pjur/linuxpower/18.10/bin/pgfortran -DBIT64 -DCAM -DCLUBB_CAM -DCLUBB_REAL_TYPE=dp -DCLUBB_SGS -DCO2A -DCPRPGI -DFORTRANUNDERSCORE -DHAVE_COMM_F2C -DHAVE_F2003_PTR_BND_REMAP -DHAVE_GETTIMEOFDAY -DHAVE_MPI -DHAVE_NANOTIME -DHAVE_SLASHPROC -DHAVE_TIMES -DHAVE_VPRINTF -DLINUX -DLSMLAT=1 -DLSMLON=1 -DMAXPATCH_PFT=numpft+1 -DMCT_INTERFACE -DMODAL_AER -DMODAL_AERO -DMODAL_AERO_4MODE_MOM -DNC=4 -DNDEBUG -DNO_LAPACK_ISNAN -DNO_R16 -DNO_SHR_VMATH -DNP=4 -DN_RAD_CNST=30 -DPCNST=40 -DPCOLS=16 -DPIO1 -DPLAT=1 -DPLEV=72 -DPLON=866 -DPSUBCOLS=1 -DPTRK=1 -DPTRM=1 -DPTRN=1 -DRAIN_EVAP_TO_COARSE_AERO -DSPMD -D_MPI -D_PRIM -D_WK_GRAD -i4 -gopt -time -Mstack_arrays -Mextend -byteswapio -Mflushz -Kieee -Mallocatable=03 -O2 -DSUMMITDEV_PGI -I. -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/pgi/spectrum-mpi/nodebug/nothreads/mct/include -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/pgi/spectrum-mpi/nodebug/nothreads/mct/mct/noesmf/c1a1l1i1o1r1g1w1i1e1/include -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/netcdf-4.6.1-rm3pbg7vsmxrqrskto27z2pgvogvhcvw/include -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/netcdf-fortran-4.4.4-xwpszk4dbh7aedfpkeg3nauczudncscy/include -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/pgi/spectrum-mpi/nodebug/nothreads/mct/include -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/pp_linoz_mam4_resus_mom_soag -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/modal_aero -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/aerosol -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/mozart -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/chemistry/utils -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_mcica -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_lw -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_sw -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/cam -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/clubb -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/dynamics/se -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/homme/src/share -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/homme/src/preqx -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/homme/src/preqx/share -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/cpl -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/control -I/autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/utils -I. -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/lib/include -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/lnd/obj -I/ccs/home/sarat/e3sm-scratch/cli115/pgifne4_june8/bld/lnd/obj -Mfree -O0 -c /autofs/nccs-svm1_home1/sarat/repos/E3SM/components/cam/src/physics/rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90 -o CMakeFiles/atm.dir/cam/src/physics/rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90.o -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/spectrum-mpi-10.3.0.0-20190419-femacxmvjgm5iwt444yxbk2merwcz6s6/include -lpthread -I/autofs/nccs-svm1_sw/summit/.swci/1-compute/opt/spack/20180914/linux-rhel7-ppc64le/pgi-18.10/spectrum-mpi-10.3.0.0-20190419-femacxmvjgm5iwt444yxbk2merwcz6s6/lib/PGI
Timing stats: init 1366 millisecs 0% import 2984 millisecs 0% expand 1935 millisecs 0% vectorize 21646 millisecs 4% optimize 6984 millisecs 1% schedule 494266 millisecs 92% unroll 6469 millisecs 1% Total time 535650 millisecs
But it practically hangs there (I presume in the last step in processing the assembly file using llc - a child process from the compile) .
/autofs/nccs-svm1_sw/summit/.swci/0-core/opt/spack/20180914/linux-rhel7-ppc64le/gcc-4.8.5/pgi-18.10-w2qkfaill2olnyrfykr5a3kmwaz6pjur/linuxpower/18.10/share/llvm/bin/llc /tmp/pgfortranU8NekgEg2-wH.llvm -mcpu=native -O2 -disable-ppc-preinc -code-model=large -o /tmp/pgfortran_8Ne6o8VTb4Q.s
That's one of the files where optimization needs to be disabled for it to compile in a reasonable time. Did it try to optimize this file?
Yes, there is -O2 flag. So, are you disabling optimization in a Depends file on other platforms? What other files are known trouble-makers in the same vein?
@sarats , the -O2 flag is expected, but there should be a -O0 flag later in the compile line that overrides it.
The troublemakers on all platforms are (from components/CMakeLists.txt):
set(NOOPT_FILES "cam/src/physics/rrtmg/ext/rrtmg_lw/rrtmg_lw_k_g.f90;cam/src/physics/rrtmg/ext/rrtmg_sw/rrtmg_sw_k_g.f90")
And summit also disables optimizations on additional files (Depends.summit.cmake):
list(APPEND NOOPT_FILES
cam/src/dynamics/eul/dyn_comp.F90
cam/src/dynamics/fv/dyn_comp.F90
cam/src/dynamics/se/dyn_comp.F90
cam/src/dynamics/sld/dyn_comp.F90
cam/src/physics/cam/microp_aero.F90)
So, to summarize, it appears that compilation of rrtmg_sw_k_g.f90 is the bottleneck . I just looked at my compile line, and it does have -O0 (after -O2). So does rrtmg_lw_k_g.f90 . There are 4 files total with the -O0 flag, others being dyn_comp.F90 and microp_aero.F90.
Found one comment about "slow performance in llc.exe", but nothing like the 90 mintues we are seeing. From that one comment (http://llvm.1065342.n5.nabble.com/llvm-dev-slow-performance-in-llc-exe-to-do-with-large-global-floating-point-arrays-td128302.html):
We are finding some pretty bad compiler performance in some cases.
For example, this github gist contains a bitcode file which is a neural
network compiled by ELL and it has about 30mb of floating point data,
and when we put that through llc it takes 262 seconds to compile
(on an Intel(R) Xeon(R) CPU E5-1650 0 @ 3.20GHz), but if we strip out
the weights the "code" component of our neural network inference takes
only 2 seconds to compile.
So, rrtmg_sw_k_g.F90 has LOTS on data, maybe 3.8GB when measured via ascii representation. Perhaps this is the same issue.
Only response to the above observation was
Have you tried -time-passes to see where it’s actually spending time?
I don’t think there’s currently a timer that covers printing global variables
to assembly, but you should be able to rule out something else.
This was May 9, 2019. @sarats, it would be interesting to see if the IBM compiler has the same behavior. Does it also use LLVM?
Note that rrtmgp reads in the data (see physics/rrtmgp/external/rrtmgp/data/). Perhaps it will not have the same problem?
... 3.8GB ... (not 3.8GM)
Yes, that's my plan as well - to see how XL behaves. I didn't get to it today. We probably should package a few files along with the troublemaker for a reproducer and report to PGI.
@worleyph I agree the issue is with initializing floating point data.
On a hunch, I tried PGI 19.4 and the build time improved substantially. ne4 ran fine. ne30 is running presently. Previously, we had problem with moving to 19.3 (?) but things seem to run okay with 19.4 now.
ne4:
cam built in 613.689390 seconds
clm built in 730.121472 seconds
Time spent not building: 3.063171 sec
Time spent building: 1038.593147 sec
ne30:
cam built in 586.199732 seconds
clm built in 750.239044 seconds
Time spent not building: 2.837199 sec
Time spent building: 1056.434291 sec
According to Ed D'Azevedo, on Titan use of the LLVM was an option when using PGI (if I remember Ed's comment correctly) and was not used by default. Perhaps it is still an option (that can be disabled) on Summit? Might ask your contacts at the OLCF.
Just realized that 19.4 is a newer compiler version (I was thinking 18.4 :-) .) Never mind. Newer is (sometimes) better.
Edited: I don't think the LLVM backend is optional anymore.
Opening issue to continue discussion from https://github.com/E3SM-Project/E3SM/pull/2973
It looks like 'cam' used to build much faster than it does now.
Using PGI compilers. June 7 log: cam built in 5274.946738 seconds.
Does this correlate to build times on other platforms? For reference, GMAKE_J is 32 on Summit and hasn't been changed lately.
New variables: Cmake, RRTMG (?)
Apr 3 log: cam built in 471 sec (needs confirmation) Trying to figure out pre-cmake version that's good for comparison? Any recent stuff under cam which substantially increased the build time?
@jgfouca Any pointers to make it an apples-to-apples comparison?