Open sarats opened 1 year ago
Ref: Source file in master: https://github.com/E3SM-Project/E3SM/blob/master/components/eam/src/chemistry/mozart/UCI_cld_sub_mod.f90#L573
Removing includes etc, following is able to reproduce the error.
amdflang -march=znver3 -DBIT64 -DCAM -DCNL -DCO2A -DCPRAMD -DCRM_DT=10 -DCRM_DX=2000 -DCRM_NX=64 -DCRM_NX_RAD=4 -DCRM_NY=1 -DCRM_NY_RAD=1 -DCRM_NZ=50 -DFORTRANUNDERSCORE -DHAVE_COMM_F2C -DHAVE_F2003_PTR_BND_REMAP -DHAVE_GETTIMEOFDAY -DHAVE_MPI -DHAVE_NANOTIME -DHAVE_SLASHPROC -DHAVE_TIMES -DHAVE_VPRINTF -DHOMME_ENABLE_COMPOSE -DLINUX -DLSMLAT=1 -DLSMLON=1 -DMAXPATCH_PFT=numpft+1 -DMCT_INTERFACE -DMMF_SAMXX -DMODEL_THETA_L -DNC=4 -DNDEBUG -DNO_R16 -DNP=4 -DNPG=2 -DN_RAD_CNST=30 -DPCNST=12 -DPCOLS=16 -DPLAT=1 -DPLEV=60 -DPLON=384 -DPSUBCOLS=1 -DPTRK=1 -DPTRM=1 -DPTRN=1 -DSPMD -DUSE_COSP -DYES3DVAL=0 -D_MPI -D_PNETCDF -D_PRIM -Dsam1mom -c /lustre/orion/cli115/proj-shared/testing/E3SM/components/eam/src/chemistry/mozart/UCI_cld_sub_mod.f90
@rljacob it is not a MOSART/river issue, but EAM/atmosphere issue?
Sorry. I got mozart chemistry confused with mosart river.
@sarats , the line 573 in question seems okay to me.
integer I,K,L,LL,N,NC, L1,L2,L3, LCLTOP,LCIRRUS
I am not sure why AMD compiler complains about the syntax. But no errors on the line 570 above:
integer :: NRGX, NICAX
I don't have access to Frontier to test it. Can you try if this works (just a guess)?
integer :: I,K,L,LL,N,NC, L1,L2,L3, LCLTOP,LCIRRUS
I tried that already. It looks like a compiler bug - we will submit to OLCF. cc @grnydawn.
@tangq I'm trying to put together a standalone reproducer. Can you point to where CBIN_
is defined? Is it passed as an argument or defined elsewhere?
Never mind, found https://github.com/E3SM-Project/E3SM/blob/master/components/eam/src/chemistry/mozart/UCI_fjx_cmn_mod.F90#LL246
integer, parameter :: CBIN_ = 10 ! # of quantized cloud fration bins
Great that you found it already.
Submitted to OLCF, ref OLCFHELP-12499.
Last response from OLCF:
I've been told that amdflang is currently not actively getting support as AMD is working on a new compiler to replace it. So I am doubtful we'll get a fix for this. So if you're able to use the Cray or GNU compilers, then our recommendation is to stick to those for Fortran.
FWIW, we always knew that AMD's Fortran is flaky. There are currently 62 build errors and 67 test failures on Crusher. Even recently, we had as low as 10 build errors and 15 total test fails. https://my.cdash.org/build/2340290
Anyway, we should probably pause the automated nightly tests with amdflang on Crusher. We can revisit once we identify any workaround.
@sarats I just stumbled on the same problem using GNU on Summit and I think it's related to the "-DNC=4" that is added for the SE build, which seems to replacing the "NC" variable in the UCI chemistry code...? However, this only happens with the new MMF I'm working on integrating and not the "main" MMF or the non-MMF configurations. So I'm very confused about this... All cases define "-DNC=4" and all cases build the UCI files...
I can't supply a reproducer at the moment because this is only happening in a temporary branch that I'm using to test a merge with the current master branch into my PAM development branch.
IIRC NC was used for a previous implementation of SL transport in homme. It is still in a lot of homme files, though i did not look closely.
@oksanaguba would you clarify how this is related to SL transport? I don't see the NC
symbol at all in Homme. Homme of course uses NP
extensively, independently of transport method and other details.
Edit: Are you referring to that ancient, pre-COMPOSE, version from before my time? It is interesting that eam/bld/configure has this:
if ($dyn_pkg eq 'se') {
my $csnp = $cfg_ref->get('csnp');
$cfg_cppdefs .= " -DNP=$csnp -DNC=4 -DHAVE_F2003_PTR_BND_REMAP";
and yet Homme doesn't use NC
.
yes, i meant the old FV scheme from maybe CSLAM and/or C. Erath's work. I agree that EAM build of homme probably does not need NC, but NC is in a lot of standalone files, and also in
src/theta-l/config.h.cmake.in:#define NC @NUM_CELLS@
Maybe simply removing it from eam config would work, or not.
I think I see the issue. First, for cleanliness, we definitely should not define NC in bld/configure. We also ought to make NP something like HOMME_NP.
But, second, the AMD-related issue is because the compiler is not differentiating between f90 and F90. The first by convention does not trigger the macro preprocessor, while the second does. The chemistry files in question use f90, so with most compilers, NC use is safe. With the AMD one, evidently it's not.
Here's a reproducer with gfortran to explain what I mean:
$ cat define.F90
program main
integer :: NC, NP
NC = 1
NP = 3
print *, NC, NP
end program main
$ gfortran -DNC=4 define.F90
define.F90:2:14:
integer :: NC, NP
1
Error: Invalid character in name at (1)
define.F90:3:5:
NC = 1
1
Error: Invalid character in name at (1)
$ cp define.F90 define.f90
$ gfortran -DNC=4 define.f90
$ ./a.out
1 3
Great find @ambrad! This is consistent with all the symptoms I've been seeing with GNU.
There must be flags to make these compilers distinguish between f90 and F90, right?
There is probably a flag to force it to run the Fortran internal preprocessor (which should be used instead of cpp anyway).
I'm testing e3sm_developer with -DNC=4
removed and will report back.
Update: Looks good, so I'll make a PR.
On Frontier, the AMD compiler is having trouble parsing
mozart/UCI_cld_sub_mod.f90
.Compiler version
The source lines in question.
Additional context for source. A common pattern is the presence of
CBIN_
around the site of the errors.