ESCOMP / CISM

Community Ice Sheet Model
GNU Lesser General Public License v3.0
6 stars 11 forks source link

cesm build of cism is very slow #53

Closed jedwards4b closed 8 months ago

jedwards4b commented 2 years ago

CISM is one of the slowest builds in a CESM BMOM case.
I clocked 409 s on cheyenne in case PFS.ne30pg3z58_t061.B1850MOM.cheyenne_intel

billsacks commented 2 years ago

Thanks for opening this issue @jedwards4b . I have noticed this as well, mainly / only with the intel compiler (gnu build times are fast – or at least, they were a year ago, when I noticed this issue with intel build times).

I noticed this got worse about a year ago:

whlipscomb commented 2 years ago

@jedwards4b and @billsacks, thanks for looking at this. Adding @Katetc to the thread. I'd very much like to identify and fix the problem. I usually use the gnu compiler for code development because intel is so slow.

What's the best way to approach the issue? Are there some general rules about code structures to avoid? Or good ways to identify the offending procedures or lines of code?

billsacks commented 2 years ago

I don't have any good strategies for approaching this. I would probably start by identifying the offending file(s) by looking at the build time of each file. I'm not sure if there's a way to get build time information for each file in the build log (@jedwards4b do you know?); if not, you could set GMAKE_J=1 then watch the build log output and see if it stalls out on a file. Assuming you can identify a problematic file, you could look at the diffs between cism2_1_78 and cismwrap_2_1_79 to see if anything looks like a likely culprit. But I'm not sure how easy it will be to identify that. I guess my hope would be that we could identify an offending file without too much trouble, and then, if we're lucky, the diffs won't be too extensive and/or there will be something fairly obviously weird about the changes in that file....

@jedwards4b do you have any suggestions for a better way to look into this? Also, I'm wondering if, before spending a lot of time on this, it would be worth trying the build with a more recent version of the intel compiler (we're using v 19 on cheyenne, so 3 years old): it may be that the problem goes away with a more recent compiler version, in which case it might not be worth spending a lot of time trying to figure this out. However, I'm also not sure how hard it would be to get the build working with a newer intel version.

jedwards4b commented 2 years ago

So if you look at the timestamps of the object files produced I think you can get some idea of what is going on: For example this build started at 13:13 as evidenced by the timestamp of the Filepath file: -rw-r--r-- 1 jedwards ncar 342 Apr 20 13:13 Filepath

and ended at 13:20 with the nuopc cap file -rw-r--r-- 1 jedwards ncar 76424 Apr 20 13:20 glc_comp_nuopc.o

It looks like most of the time was spent in compiling the glide_io file: -rw-r--r-- 1 jedwards ncar 238870 Apr 20 13:14 glissade_velo.mod -rw-r--r-- 1 jedwards ncar 73564004 Apr 20 13:18 glide_io.mod -rw-r--r-- 1 jedwards ncar 449775 Apr 20 13:19 glide_stop.mod

Katetc commented 2 years ago

Thanks for pointing this out guys. I've been looking at it this afternoon (starting to have more time for land ice work!) and I do see 4 minutes or so spent building glide_io.F90. This file is auto-generated at build time, but that doesn't actually seem to be the slow part. The slow part is the actual compiling of the file. Now, I know a big difference between cism2_1_78 and cismwrap_2_1_79 was the number of namelist fields. We added several new namelist and output fields between these tags. I'm not sure, but I think glide_io.F90 became much longer after this tag. And, I'm noticing this file uses a weird c def method for defining file paths:

define NCO outfile%nc

define NCI infile%nc

And both of these c-def variables are referenced many, many times: if (.not.outfile%append) then status = parallel_def_dim(NCO%id,'x0',model%parallel%global_ewn-1,x0_dimid) else status = parallel_inq_dimid(NCO%id,'x0',x0_dimid) endif

This type of using c-defined objects with properties referenced is not something I've seen very often. I could see an Intel Fortran compiler (or another fortran compiler) having some issues with it.

whlipscomb commented 2 years ago

@Katetc, Nice sleuthing. If you have time tomorrow, let's follow up and talk about whether we can get the same functionality without the c-def variables.

jedwards4b commented 2 years ago

I would be surprised if the cpp macros were the cause of the slowdown.

whlipscomb commented 2 years ago

@jedwards4b, is there another possible explanation?

jedwards4b commented 2 years ago

The file glide_io.F90 is autogenerated, but that step happens very quickly. It is the fortran compile of the autogenerated file that is taking so long. I timed it at 4:47 with -O2 and 4:14 with -O0. Subroutine glide_io_create is some 7000 lines.

whlipscomb commented 2 years ago

@jedwards4b, Indeed it's a long file, but there are other big files in CISM that compile in a few seconds. I'm wondering if there are specific structures in the autogenerated file that trip up the Intel compiler (but which the gnu compiler, for whatever reason, handles more efficiently). If we can identify those structures, then we may be able to modify the autogenerate script to do things differently.

whlipscomb commented 2 years ago

Here's another possibility. At the end of module glide_io.F90 there are many accessor subroutines, of the form glide_set_field(data, inarray) and glide_get_field(data,outarray). Each subroutine uses four modules (glimmer_scales, glimmer_paramets, glimmer_physcon, glide_types) without an 'only' specification. Is it taking the compiler a long time to bring in the other modules? If so, we could either figure out a way to add the appropriate 'only', or do without these subroutines entirely. The used modules, especially glide_types, have grown over time.

billsacks commented 2 years ago

It also seems possible that just having so many separate use statements could cause problems, whether or not they have an "only" clause. What about consolidating them so that they appear at the top of the module rather than being separately listed for each subroutine?

whlipscomb commented 2 years ago

@billsacks, That's a good suggestion, and easy to implement. I'll give it a try.

Katetc commented 8 months ago

This change was implemented in CISM PR #57 and PR #58 , both contained in cism tag cism_main_2.01.013 and included in cism wrapper tag cismwrap_2_1_97 which will be included in cesm2_3_alpha17a. Marking as addressed and closing the issue.