[QUESTION] Running gchp on a newly commissioned super computer

philotrum commented 4 years ago

I am trying to run gchp on a newly commissioned super computer (gadi NCI Australia). The old computer (raijin) is still running, and we are migrating to the new computer. I have copied the code directory (12.2.0_gchp) and a run directory to the new computer. The run is failing at the point where it looks for the libesfm.so file. It is looking in the path for the old computer, which obviously doesn't exist. Is this path compiled in, or can it be edited without compilation?

yantosca commented 4 years ago

Thanks for writing. Have you tried rebuilding GCHP from scratch on the new computer? That might solve the issue.

lizziel commented 4 years ago

If switching to a new system you should clone GCHP from Github and create a new run directory from the new clone. You will be prompted to set some paths during that initial run directory creation on the new system. Copying the source code and run directory from a different system will cause problems that are easier to sort out by starting from scratch.

philotrum commented 4 years ago

Thanks for getting back to me. I am trying to build from scratch now, but am getting errors on the build. I set up a new environment file, and created a new run directory. This all looks ok to me. I did a make superclean and then make build. I have tried to build twice. I used different versions of openmpi and netcdf for each build. The error is different this time around. I am trying to sort through the build output now. I would like to test the exact same build on the new computer, so I would like to avoid a recompile if I can. Also, the time that both computers are running is pretty limited, so there is a bit of pressure on at the moment.

yantosca commented 4 years ago

If the new computer has a slightly different OS version than the old computer, then the object files (.o), mod files (.mod) and shared libraries (like libesmf.so) might be unreadable on the new computer. So it might not be possible to do a direct copy-over.

Could you provide more information about the compiler, OS, library versions, etc. that you are using? We would need that info in order to make a recommendation as to how to try to resolve the issue.

lizziel commented 4 years ago

Please also include your build log if you are still having issues. You can rename it to have extension .txt and drag and drop it directly into the comment box to share the file.

philotrum commented 4 years ago

Thanks again for your help. Here are the compile logs from both builds.

compile_success.txt compile_fail.txt

They differ at:

%%% Making install in ', /scratch/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/GFDL_fms %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% gmake[7]: Entering directory '/scratch/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/GFDL_fms' Building dependency file affinity.d ./shared/mpp/include/mpp_define_nest_domains.inc:514:20: warning: missing terminating ' character !--- to_pe's east as opposed to:

%%% Making install in ', /short/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/GFDL_fms %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% gmake[7]: Entering directory `/short/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/GFDL_fms' Building dependency file xbt_drop_rate_adjust.d Building dependency file affinity.d Building dependency file read_mosaic.d

lizziel commented 4 years ago

I see two primary issues in your 'fail' log file.

f2py is not found which makes module GMAO_gfio not fully build. This is actually not a problem since the GMAO_gfio gets far enough along and most of the content is not needed in GCHP. You have this issue in your 'success' log as well. I put a fix in to at least avoid the problem starting in 12.5.0 in case you are interested: https://github.com/geoschem/gchp/commit/862bd7acf600fa42b0d4756b284307a901b346bb.

MAPL_Base does not compile to completion. This is the core problem that you have since pretty much all of MAPL_Base is needed in GCHP. Here is the error that is encountered:


mpif90 -c -DsysLinux -DESMA32 -DHAS_NETCDF4  -DH5_HAVE_PARALLEL -DMAPL 
-DDO_COMMAS -DTWO_SIDED_COMM -DHAVE_SHMEM  -I/scratch/m19/gck574/GC/rundirs
/geosfp.gchp.standard_gadi/CodeDir/GCHP/ESMF/Linux/include/ -I. -I/scratch/m19/gck574
/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir/GCHP/ESMF/Linux/include/ -I/this/is/fake -I/usr/include  -I/scratch/m19/gck574/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir/GCHP/ESMF
/Linux/mod/ -I/scratch/m19/gck574/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir/GCHP/ESMF
/Linux/mod/ -I. -I/scratch/m19/gck574/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir
/GCHP/ESMF/Linux/include/ -I/this/is/fake -I/usr/include -I/scratch/m19/gck574/GC/rundirs
/geosfp.gchp.standard_gadi/CodeDir/GCHP/Shared/Linux/include/MAPL_cfio_r4 -I/scratch
/m19/gck574/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir/GCHP/Shared/Linux/include
/GMAO_mpeu  -O3 -ftz -align all -fno-alias     -DUSE_CUBEDSPHERE  -fPIC -fpe0  -align 
dcommons MAPL_MaxMinMod.F90
<stdin>:2046:22: error: C++ style comments are not allowed in ISO C90
<stdin>:2046:22: error: (this will be reported only once per input file)
gmake[8]: *** [/scratch/m19/gck574/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir/GCHP/Shared/Config/ESMA_base.mk:382: ESMFL_Mod.o] Error 1
gmake[8]: *** Waiting for unfinished jobs....
<stdin>:4653:54: error: C++ style comments are not allowed in ISO C90
<stdin>:4653:54: error: (this will be reported only once per input file)
gmake[8]: *** [/scratch/m19/gck574/GC/rundirs/geosfp.gchp.standard_gadi/CodeDir/GCHP/Shared/Config/ESMA_base.mk:382: MAPL_IO.o] Error 1
                                gmake[8]: Leaving directory '/scratch/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/MAPL_Base'
gmake[7]: *** [GNUmakefile:66: install] Error 2
gmake[7]: Leaving directory '/scratch/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/MAPL_Base'


I did a web search for the error message "<stdin>:2046:22: error: C++ style comments are not allowed in ISO C90" and found @JiaweiZhuang ran into this same issue when building GCHP on AWS (https://github.com/geoschem/gchp/issues/16). Try making the suggested fix in file GCHP/Shared/Config/ESMA_base.mk. Unfortunately this fix has not yet made it into the standard code as it has not been high priority. I can add it to 12.7.0 in case any others run into it but beware the file will be gone in 13.0.0 when we switch to building with CMake.

As an aside, I recommend that you read the "Compiling Tips" section of the GCHP manual on compilation. It includes tips on how to hone in on GCHP compile errors in the log, and why the log is so hard to read. The gist is that MAPL continues to compile even if one of its modules hits an error. It just goes on to the next module, and usually finally stops when GFDL compiles later on and the MAPL_Base library isn't found for linking. Errors are best found by searching for 'Making install' and looking at the lines directly above it to check that modules compiled successfully. @JiaweiZhuang wrote a script to help identify the compile problems you could also try; see https://github.com/geoschem/gchp/issues/41.

philotrum commented 4 years ago

I got this message from a support person at NCI:

Hi Graham We have managed to trace this issue down to a change in the behaviour of the C preprocessor between Gadi and Raijin. The undefined symbols at the end of your compile.log file on Gadi should be found in libMAPL_Base.a, which is missing on Gadi. This library is missing because it cannot create 'ESMFL_Mod.o' due to failing at the preprocessing stage. This failure is due to the preprocessor interpreting the Fortran string concatenation operator , '//' as a C++ style comment, which is disallowed when the -ansi flag is passed to the preprocessor. Strictly speaking, this is the correct behaviour for the preprocessor, and this file should also have failed to build on Raijin, however, it appears that the older C preprocessor on Raijin lets this through. I managed to get the library created by changing the following rule in ESMA_base.mk (line 381):
.P90.o:.
@sed -e "/\!.*'/s/'//g" $< | $(CPP) $(CPPANSIX) $(FPPFLAGS) > $*___.f90
$(ESMA_TIMER) $(FC) -c $(f90FLAGS) -o $*.o $*___.f90 
@$(RM) $*___.f90
to
.P90.o:.
@sed -e "/\!.*'/s/'//g" $< | fpp  $(FPPFLAGS) > $*___.f90
$(ESMA_TIMER) $(FC) -c $(f90FLAGS) -o $*.o $*___.f90 
@$(RM) $*___.f90 
However, this is a common Makefile, and I haven't tested how this affects the other libraries geoschem needs.

I haven't tried another compile yet. I will do that and see how things go. I will let you know what happens.

philotrum commented 4 years ago

compile_gadi.txt I have managed to build gchp on the new computer, but I have a number of fatal errors in the compile.log file. I am using the bash script you suggested to scan for the errors . Thanks for pointing me to it.

philotrum commented 4 years ago

Do these errors look like they will cause me problems? I compiled with your suggested edits to ESMA_base.mk. It failed to build. I made the changes suggested by Dale at NCI and left yours in when I managed to build the geos binary.

philotrum commented 4 years ago

I have tried running, but there is no output from the run. It is running for ~26 minutes, and then exits. I have spent some hours trying to figure out what is happening, but am feeling out of my depth. Here are the output files. Could you offer any assistance?

gchp.txt GCHP-normal.e191758.txt GCHP-normal.o191758.txt

yantosca commented 4 years ago

Thanks for writing. Can you try to compile with an older version of ifort (like ifort17)? I am not sure if we have yet successfully compiled GCHP with ifort 2019 on our end.

Also note: we are going to be out of the office later this week for the Thanksgiving holiday in the US (Nov 28-29).

LiamBindle commented 4 years ago

Hi everyone, I'm coming in on this late, so I appologize in advance if I've misunderstood anything.

@philotrum, I see you're using OpenMPI 4 on Gadi. According to @sdeastham, OpenMPI 4 only works with ESMF 8. GCHP 12.2 uses ESMF 7.1.0r so it's odd GCHP (specifically ESMF) is compiling on Gadi with OpenMPI 4.

Does Gadi have a OpenMPI 3 environment you could try?

sdeastham commented 4 years ago

Incidentally –you can replace ESMF v7.1.0r with ESMF 8.0.0 with no modifications. Just go to your GCHP directory, and execute the following two lines:

mv ESMF unsafe_ESMF_v7 # For safe keeping – you can actually just delete it git clone -b ESMF_8_0_0 --depth 1 https://git.code.sf.net/p/esmf/esmf ESMF

Then recompile EVERYTHING (./build.sh clean_all; ./build.sh build) with OpenMPI v4. I’ve been running this with GCHP 12.6.2 + OpenMPI v4.0.1 and OpenMPI v4.0.2 for several days now without issue.

jennyfisher commented 4 years ago

Thanks all - I'll let Graham test these things, this is really helpful. Just a note on software versions - we are really limited on Gadi as they are not porting old versions. Options are: intel-compiler/2019.3.199 intel-compiler/2019.4.243
openmpi/2.1.6 openmpi/3.0.4 openmpi/3.1.4 openmpi/4.0.1(default) netcdf/4.6.3 netcdf/4.6.3p netcdf/4.7.1 netcdf/4.7.1p

If there is a recommended configuration from these options, please let us know!! Thanks again!

jennyfisher commented 4 years ago

Hi all, Happy Thanksgiving! When you get back... We have been having further conversations with our supercomputer support team and they are sending us back to you based on the remaining errors in the compile.log file (although we are getting an executable - which is also crashing - more on that later).

Here is what they said:

The problem with these messages:

Building dependency file mpp_update_domains2D_nonblock.d shared/oda_tools/write_ocean_data.F90:38:10: fatal error: netcdf.inc: No such file or directory

include
      ^~~~~~~~~~~~
compilation terminated. is that you are using a custom, non-standard parser for generating the .d files (/scratch/m19/gck574/GC/Code.12.2.0_gchp/GCHP/Shared/Config/bin/fdp) that doesn't understand how to find header and module files on our systems (i.e. it directly searches for the file where it thinks it should be). You likely would have been getting the same messages on Raijin since it uses a similar method for distributing header and module files, so I'm guess they're harmless. If you use the compilers' own built-in dependency list generators it should work properly.

These lines

cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory are exactly what it says – it's trying to copy a file from ESMF that doesn't exist. I don't know why it wants to copy this file, or why it's not there (an incomplete copy of the source?)... you'll probably need to ask the developers about this one. Similarly, the cat related ones are trying to read from files that don't exist – again, I don't know why they're not there.

I asked for some clarity on the comment on using the compilers' own built-in dependency list generators and the response was:

You're looking for the -M (icc) or -gen-dep (ifort) options to the compilers . These make it generate a GNU Make-compatible list of dependencies, the same as what that custom Perl script should be creating.

This is all well outside of anything Graham or I understand -- but perhaps means something to you and how we can proceed here?

Also despite the compile errors we are getting an executable. Here is the traceback:

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source     
libifcoremt.so.5   000014B5F2889555  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  000014B5F00EBD80  Unknown               Unknown  Unknown
libc-2.28.so       000014B5EFBFD939  __xstat64             Unknown  Unknown
libifcoremt.so.5   000014B5F288F3ED  for_inquire           Unknown  Unknown
geos               0000000000864A6A  Unknown               Unknown  Unknown
geos               000000000085EBC8  Unknown               Unknown  Unknown
libesmf.so         000014B5F4795507  Unknown               Unknown  Unknown
libesmf.so         000014B5F4798903  Unknown               Unknown  Unknown
libesmf.so         000014B5F4B55945  Unknown               Unknown  Unknown
libesmf.so         000014B5F479659D  Unknown               Unknown  Unknown
libesmf.so         000014B5F4D6184D  esmf_compmod_mp_e        1199  ESMF_Comp.F90
libesmf.so         000014B5F4FA13C6  esmf_gridcompmod_        1889  ESMF_GridComp.F90
geos               0000000000851B5C  Unknown               Unknown  Unknown
geos               0000000000703D3C  MAIN__                     38  GEOSChem.F90
geos               000000000040F2B2  Unknown               Unknown  Unknown
libc-2.28.so       000014B5EFB34813  __libc_start_main     Unknown  Unknown                         
geos               000000000040F1BE  Unknown               Unknown  Unknown

Thanks!

sdeastham commented 4 years ago

Hi Jenny,

Happy Thanksgiving! I realized that you listed a compiler, an MPI installation, and a NetCDF-C installation, but not NetCDF-Fortran. The "netcdf.inc" file is part of NetCDF-Fortran, which is now a separate package (although NetCDF-C is a dependency). Do you have NetCDF-Fortran installed? If so, am I right in assuming that it's in a different location than the NetCDF-C installation? This would explain the "missing netcdf.inc" errors. Fixing this is in theory not too difficult.

Regards,

Seb

jennyfisher commented 4 years ago

Hi Seb,

I'll ask, but I don't think that's it. When we load the netcdf/4.7.1 module, it includes the "netcdf.inc" file under $NETCDF_ROOT/include/Intel/netcdf.inc

Also wouldn't explain the missing "mpi.h" which has a similar home in $OPENMPI_ROOT/include/

Thanks, Jenny

sdeastham commented 4 years ago

Hmm. Looking at the compile log for gadi, I see a number of strange things:

ESMF_COMM is set to nci, not openmpi
GC_BIN, GC_INCLUDE, and GC_LIB aren't set
MPI_ROOT isn't set
The NetCDF include paths aren't present in the compile log

Can you check the file /scratch/m19/GCHP_files/gchp.ifort18_openmpi_nci.env? It should have lines like the following if it doesn't already:

export ESMF_COMM=openmpi export COMPILER=intel export GC_BIN=$NETCDF_ROOT/bin/Intel # Assuming this is where nc-config and nf-config are? export GC_INCLUDE=$NETCDF_ROOT/include/Intel export GC_LIB=$NETCDF_ROOT/lib/Intel # This may be lib64, rather than lib - worth checking export MPI_ROOT=$OPENMPI_ROOT

Apologies if this is all stuff you've already been over!

jennyfisher commented 4 years ago

Hi Seb,

These are the changes that NCI help originally made for us to get GCHP to compile on the previous supercomputer (Raijin) which is set up in the same was as Gadi. Because they use compiler wrappers, using the openmpi option for ESMF_COMM couldn't find any of the necessary library files... So he made a new ESM_COMM specific to NCI. Basically, on our system we need to not specify paths directly and "trust" the compilers to get the flags right. I think you were on the email thread about this back in March and said they looked ok... I can re-send through those updates though if you'd like?

Cheers, Jenny

philotrum commented 4 years ago

Hi Seb,

Here is the environment file we are using. Any help is appreciated, even if we may have been over it before!

gchp.ifort18_openpmi_nci.env.txt

Cheers,

Graham

sdeastham commented 4 years ago

Hi Jenny,

Got it (and thanks for the reminder!). This unfortunately does complicate things – by which I more mean that it knocks the bottom out of a couple of my theories. I understand better now your earlier comment. If I’m following it correctl, the mpif.h and netcdf.inc errors are because, for certain operations, the Makefile isn’t relying on the MPI wrappers, and is assuming that the MPI and NetCDF include files are still specified the “old fashioned” way. As mentioned, this makes it very weird that GCHP still compiles, but also means that I’m not particularly surprised that the executable fails.

If there was an easy way for me to try things directly I’d be happy to run a couple of experiments of my own on NCI, but a quick browse suggests that that is not straightforward! I realize also that you are stuck with 12.2.0, so my usual go-to of “try a more recent version” isn’t going to be much help. Can you send the compile log which shows the issue with ESMF_LapackBlas.inc? I don’t see that issue in compile_gadi.txt. I have a gut feeling that this might be related to some confusion where system GNU compilers are incorrectly doing some of the compilation work, but it’s nothing more than a hunch.

Regards,

Seb

From: Jenny Fisher [mailto:notifications@github.com] Sent: Thursday, November 28, 2019 5:39 PM To: geoschem/geos-chem geos-chem@noreply.github.com Cc: Sebastian Eastham seastham@mit.edu; Mention mention@noreply.github.com Subject: Re: [geoschem/geos-chem] [QUESTION] Running gchp on a newly commissioned super computer (#143)

Hi Seb,

These are the changes that NCI help originally made for us to get GCHP to compile on the previous supercomputer (Raijin) which is set up in the same was as Gadi. Because they use compiler wrappers, using the openmpi option for ESMF_COMM couldn't find any of the necessary library files... So he made a new ESM_COMM specific to NCI. Basically, on our system we need to not specify paths directly and "trust" the compilers to get the flags right. I think you were on the email thread about this back in March and said they looked ok... I can re-send through those updates though if you'd like?

Cheers, Jenny

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/geoschem/geos-chem/issues/143?email_source=notifications&email_token=ABSP43BAB6ZLD5APHA2F4C3QWBCABA5CNFSM4JPM6WB2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEFNQPJQ#issuecomment-559613862, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABSP43HVOPHOOYCVBX7HSLTQWBCABANCNFSM4JPM6WBQ.

jennyfisher commented 4 years ago

Hi Seb,

Thanks! The Lapack error is in that compile file -- search for "cp: cannot stat '../include/ESMF_LapackBlas.inc': No such file or directory"

We could try using the newest GCHP version and applying the patch from before to our Makefiles in this version. Is there any reason to think that wouldn't work?

Cheers, Jenny

benmenadue commented 4 years ago

Hi,

All of the "fatal error" messages that I saw in their compile log is a result of a custom Perl script trying to parse the source files and generate a dependency file. This doesn't work on our systems as the compiler- and MPI-dependent such files (e.g. mpif.h and netcdf.inc, or any .mod file) won't be easily found on the filesystem without going through the compiler.

This is because the libraries we provide on Raijin and Gadi are installed in such a way as to support multiple different compilers and MPI libraries from a single installation prefix. The compilers and linkers know how to parse the loaded environment modules and work out which flavour and version of the libraries are being used, and thus which of and where the needed headers and modules are.

This works fine for standard UNIX build tools (e.g. autoconf) which follow the normal philosophy of "check if it works" first. If they need to test if the that header / module is available, they just invoke the compiler with a test program that includes those files. Unfortunately, if you're writing your own build tools you need to take this into account yourself.

Fortunately, thies doesn't appear to be actually causing an issue -- it's still able to compile the program, it's just that Make doesn't know about the dependencies on system headers (but it doesn't need to know this anyway as they're already there).

In terms of the runtime error, it's something different, but unfortunately not a lot of detail -- the program is explicitly calling MPI_Abort on rank 33:

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 33 in communicator MPI_COMM_WORLD
with errorcode 25051332.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

but there's nothing in the gchp.log that would suggest why...

Cheers, Ben

jennyfisher commented 4 years ago

Thanks Ben for jumping in. This side of things is well beyond our expertise, so helpful to have you talking to one another!!

In terms of where it fails, it seems to be in the ESMF code as you can see from the traceback pasted a few comments back (or Ben you can find in the GCHP-normal.eXXXX file in Graham's run directory).

benmenadue commented 4 years ago

Yep I saw that, just wasn't sure how much of those files I was allowed to share :-) .

jennyfisher commented 4 years ago

Share anything you like if it helps us get things running on Gadi! :)

sdeastham commented 4 years ago

Hi Jenny, Ben, Graham,

Ah! I understand now. Regarding your earlier question - it certainly might be worth trying with GCHP 12.6.2. It's meant to be easier to work with, in large part because of the transition to ESMF v8 (I'd recommend just immediately replacing the ESMF version in GCHP with ESMF v8.0.0 to avoid a couple of known issues). I'm relucant to push that too hard since 1) you successfully compile and 2) the next version of GCHP will leapfrog yet more of these issues because the NASA MAPL code within it dumps GNUmake in favor of a more logical CMake-based build system.

That all having been said, I think I need to take a step back and see if I can completely clarify the issue. From what I gather:

GCHP 12.2 builds OK, following some changes; a slew of errors are reported during the build, but this does not prevent compilation and it is not clear whether these are "real" issues.
The executable that is produced DOES run. But then, one of the 3 following things happens: a) An error in UpdateBracketTime is reported (https://github.com/geoschem/geos-chem/issues/143#issuecomment-558461754) b) An MPI abort message is sent (https://github.com/geoschem/geos-chem/issues/143#issuecomment-559621015) c) A traceback is printed which points to ESMF (https://github.com/geoschem/geos-chem/issues/143#issuecomment-559341770)

Are all three of these still true, or only some of them? If a) is still happening, then we may actually have been on a bit of a wild goose chase, as an UpdateBracketTime error is more usually the result of an issue with input files..

Regards (and thanks for your patience!),

Seb

jennyfisher commented 4 years ago

Thanks Seb! To answer your questions:

Yes

2a. Sort of? I can't tell if it is an error or not, but it is the last thing printed. Here is the end of the log file:

>> Reading  ALBEDO from ./MetDir/%y4/%m2/GEOSFP.%y4%m2%d2.A1.025x03125.nc
 DEBUG: Scanning template ./MetDir/%y4/%m2/GEOSFP.%y4%m2%d2.A1.025x03125.nc for side L
 >> >> >> Target time   : 2016-07-01 00:30:00
 >> >> >> Reference time: 1985-01-01 00:00:00
 DEBUG: Untemplating ./MetDir/%y4/%m2/GEOSFP.%y4%m2%d2.A1.025x03125.nc
 >> >> >> Target time   : 2016-07-01 00:30:00
 >> >> >> File time     : 2016-07-01 00:00:00
 >> >> >> Frequency     : 0000-00-01 00:00:00
 >> >> >> N             : 11504 
 DEBUG: Propagating forwards on ./MetDir/%y4/%m2/GEOSFP.%y4%m2%d2.A1.025x03125.nc from reference time
 >> >> >> Reference time: 1985-01-01 00:00:00
UpdateBracketTime                             2184  
EXTDATA::Run_                                 1245  
MAPL_Cap                                       777   
===> Run ended at Fri Nov 29 09:16:18 AEDT 2019

2b. Yes - from the job error file:

Currently Loaded Modulefiles:
 1) intel-compiler/2019.3.199   2) openmpi/3.1.4   3) netcdf/4.7.1  
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 63 in communicator MPI_COMM_WORLD
with errorcode 25051332.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

2c. Yes - again from the job error file:

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   000014B5F2889555  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  000014B5F00EBD80  Unknown               Unknown  Unknown
libc-2.28.so       000014B5EFBFD939  __xstat64             Unknown  Unknown
libifcoremt.so.5   000014B5F288F3ED  for_inquire           Unknown  Unknown
geos               0000000000864A6A  Unknown               Unknown  Unknown
geos               000000000085EBC8  Unknown               Unknown  Unknown
libesmf.so         000014B5F4795507  Unknown               Unknown  Unknown
libesmf.so         000014B5F4798903  Unknown               Unknown  Unknown
libesmf.so         000014B5F4B55945  Unknown               Unknown  Unknown
libesmf.so         000014B5F479659D  Unknown               Unknown  Unknown
libesmf.so         000014B5F4D6184D  esmf_compmod_mp_e        1199  ESMF_Comp.F90
libesmf.so         000014B5F4FA13C6  esmf_gridcompmod_        1889  ESMF_GridComp.F90
geos               0000000000851B5C  Unknown               Unknown  Unknown
geos               0000000000703D3C  MAIN__                     38  GEOSChem.F90
geos               000000000040F2B2  Unknown               Unknown  Unknown
libc-2.28.so       000014B5EFB34813  __libc_start_main     Unknown  Unknown
geos               000000000040F1BE  Unknown               Unknown  Unknown

Let us know any ideas for next steps. And a quick question - when you say the next version will have even fewer issues... What is the projected release date for that? Given we haven't done any real work with GCHP, perhaps we are best waiting - though I'd like to be at least at the stage of running some test runs in Q1 of 2020...

Cheers, Jenny

benmenadue commented 4 years ago

Hi @jennyfisher / @philotrum -- try building with -g -traceback added to the compiler flags (both C and Fortran, if you're using the Intel compilers for both). You might get extra info out of the traceback, and it doesn't have any performance impact (just uses a little both more disk space).

That said, the SIGTERM causing the backtrace is almost certainly just the result of the call to MPI_Abort (this is how MPI cleans up the other processes when something goes wrong), so I wouldn't put too much emphasis on it yet. That MPI_Abort call is the real killer...

jennyfisher commented 4 years ago

Hi Ben - I'm pretty sure -traceback is included by default (and it shows up in the compile.log file). Not -g though -- is that the C one? I don't think we have compiled C code here but I could be wrong; I'll let Seb comment there!

sdeastham commented 4 years ago

Hi Jenny,

Stupid question, but ./MetDir/2016/07/GEOSFP.20160701.A1.025x03125.nc definitely exists, right? The error I'm seeing there is typically one which would result when that file is missing. This is partially speculative (based on a quick dig through the v12.2.0 GCHP code at https://github.com/geoschem/gchp/blob/12.2.0/Shared/MAPL_Base/MAPL_ExtDataGridCompMod.F90) but I figured I should ask just in case!

As for the newer versions which have fewer issues, 12.6.2 is out now; it has the improved ESMF but still a rather user-unfriendly MAPL. The version with the newer MAPL (which is expected to have a much better build system, courtesy of Lizzie Lundgren and Liam Bindle) is in the works and in fact has a prototype available, but is not scheduled until v13.0. If we continue to have difficulties here, it might be worth jumping the gun on that though.

Regards,

Seb

philotrum commented 4 years ago

Hi Seb,

I just had a look, and it looks like the met path may be wrong. Sometimes it is worth asking the stupid question! The path is a bit odd, with a symbolic link inside the symbolic link directory. I will look into getting this straightened out and then try running again and see what happens.

Cheers,

Graham

benmenadue commented 4 years ago

@jennyfisher -g is a fairly universal (i.e. all compilers and languages) flag that means "add debugging information". The only reason to not use it is if you're space conscious -- it doesn't change the generated machine code at all, just includes an extra section in the binary with details of names, etc, and how the various parts of the machine code relate to lines in the source code.

-traceback is an Intel-specific flag that adds the information needed to print traceback details -- essentially, lines in your tracebacks that have Unknown usually come from source files that weren't built with this flag. You generally only see it when building Fortran code, but can also be used when building C source that you intend to link into a Fortran program. EDIT: as with -g, it has no performance impact, just increases the size of the binaries.

philotrum commented 4 years ago

Ok. It looks like my MetDir link may have been the source of the problem. I have pointed it to the correct directory now, and have submitted a run on the queue. I will let you know how it goes.

sdeastham commented 4 years ago

That's great news! Fingers crossed. I'm afraid that the older GCHP versions had some pretty horrendously obscure error reporting. The good news is that it's better (although by no means perfect) in the newer versions, and this is something we're now working to improve further through a NASA grant.

philotrum commented 4 years ago

I had such high hopes! I started a build as I left work last night, and came in this morning to find that it hasn't built geos. Here is the compile.log file. I will start digging around to see what I can find. compile.log.txt

philotrum commented 4 years ago

I will try it with openmpi 4.0.1, as the last successful build I did was using it, and at some time during this process I have changed it to 3.1.4. I am comparing the successful compile.log file with the fail from last night. This is the most obvious place to start.

lizziel commented 4 years ago

Updating the MetDir link does not require a rebuild. If you lost the previous executable then yes, recreate exactly the settings used for generating the previously successful build.

philotrum commented 4 years ago

I tried running it yesterday and it failed. I thought that this might be due to not rebuilding it, so I did a superclean and build, which then failed. Here is the .e file from the run yesterday. I am guessing that superclean removes the gchp.log file, as it is not there,

GCHP-normal.e247052.txt

lizziel commented 4 years ago

The environment shown in the log indicates OpenMPI 3.1. If your executable was built with OpenMPI 4 instead then this would be a problem. How are you setting your environment when you submit a run? The sample run scripts include sourcing run directory symlink gchp.env to avoid having a mismatch between build and run environments (gchp.env is sourced when building with the makefile as well).

philotrum commented 4 years ago

I just successfully compiled using openmpi 4.0.1. gchp.env on my setup is a symlink to a custom environment file for gadi. This was set up by the NCI team for the old super computer, raijin, and I copied it over and modified it for the new computer, gadi. gchp.ifort18_openpmi_nci.env.txt

sdeastham commented 4 years ago

Great! Do let us know if/how the simulation goes!

philotrum commented 4 years ago

As expected, I had the same errors as last run. Jenny noticed that the resolution specified in the ExtData.rc file is also set to 025x03125 resolution. We are running at 2x25 resolution, so the files will all be missing, as the input is looking for the high res data, which we don't have. This may not be the whole problem, but it has to be a problem.

I am not sure how the resolution is set when setting up a run. I followed the wiki to create the run directory, and don't remember anything about setting the resolution. How do I specify the model run resolution? Is the ExtData.rc file generated with the run directory? If so, how do I specify the model resolution? Can I copy an existing ExtData.rc file from another run directory that has the correct resolution and run without compiling?

I think that we are closing in on the problem (maybe).

I am a newbie, so I could be making real rookie mistakes here. Thanks for the help!

philotrum commented 4 years ago

I have copied the ExtData.rc file from the run directory on raijin that was successful, and will try again using it. I will let you know how it goes.

jennyfisher commented 4 years ago

Hi all,

Sorry for my silence, I have been tied up in meetings. I have a hunch fixing the file paths and names is going to fix things. I am so used to GC Classic that tells you when a file is missing I skimmed over this. I see now that we were pointing to all the global high resolution data and files, which we don't actually have here because we don't have the space at present. We had changed that on Raijin, but since we couldn't run the Raijin executable on Gadi Graham went through the process of creating a run directory again from scratch and we missed updating this one.

I'm not sure if this is still how things are done in the newer versions, but as Graham suggests, it would be good if users could specify their input met resolution when creating the run directory (and all the related files like ExtData.rc) rather than assuming we'll be running with 0.25x0.3125 (although could default to that if the user is unsure).

Hopefully this will fix the problems for now!

Cheers and thanks again for all the help, Jenny

sdeastham commented 4 years ago

Hi Jenny,

Notification of missing files is very much now a feature in GCHP, largely because of these kinds of difficulties - I agree that they are a giant pain.

Asking the user what resolution they want for their input could certainly work. I'm a little wary because we want to avoid the other outcome - where users are unwittingly running high-res simulations with low-res met data - but I've been spoiled by having continuous access to the high-res met. In the interim, a search-and-replace for "025x03125.nc" to "2x25.nc" in ExtData, along with changing the target of the MetDir symlink, should be sufficient. The following script should accomplish this, if supplied with the appropriate target directory:

#!/bin/bash if [[ $# -ne 1 ]]; then echo "Must provide path to either GEOS_2x2.5/MERRA2 or GEOS_2x2.5/GEOSFP" exit 70 fi if [[ -L MetDir ]]; then unlink MetDir; fi # For GEOS-FP sed -i "s/025x03125\.nc/2x25.nc/g" ExtData.rc # For MERRA-2 #sed -i "s/05x0625\.nc/2x25.nc/g" ExtData.rc ln -s $1 MetDir exit 0

You could copy and paste the above code into a file in your run directory, say change_met.sh. Assuming you're using the standard directory structure, the following command should then automatically switch your ExtData.rc and MetDir to the 2x2.5 GEOS-FP input:

./change_met.sh $(readlink -f $( readlink -f MainDataDir )/../GEOS_2x2.5/GEOS_FP)

Good luck!

Regards,

Seb

lizziel commented 4 years ago

Regarding how to set meteorology resolution (for future reference), see the section of the "Running GCHP: Configuration" wiki page titled "Change Input Meteorology Grid Resolution and Source".. I also suggest going to the top of that wiki page and browsing the table of contents to see all options. There may be other things you would like to configure that are detailed in that chapter of the manual.

More generally, it would be helpful to us if you always include files gchp.env, compile.log, gchp.log, runConfig.sh, ExtData.rc, HEMCO_Config.rc, and any std err output files for every new run. See also this post on the wiki on what to include in requests. Including the files usually results in us honing in on the issue faster.

philotrum commented 4 years ago

Good news! I have output! I am still getting some errors on exit, but it looks like it exited uncleanly after the run was finished. I think that this was the case on Raijin too. Here are all the files for the apparently successful run.

GCHP-normal.e253842.txt compile.log.txt HEMCO_Config.rc.txt HEMCO.log.txt cap_restart.txt input.geos.txt lastbuild.txt gchp.ifort18_openpmi_nci.env.txt gchp.log.txt

philotrum commented 4 years ago

Thanks again for all your help. There is no way I would have managed to get here without it.

geoschem / geos-chem

[QUESTION] Running gchp on a newly commissioned super computer #143

include