[BUG/ISSUE] Standard run crashes w/no error

arianatribby commented 2 years ago

What institution are you from?

California Institute of Technology

Description of the problem

I am a new user for gchp. I am trying to run a test run for standard chemistry, 4x5 grid, for 1 hour time simulation. I've checked the docs and also tried debug mode, but the output was too large - there was an error with a "scalar" with the output. So I recompiled back to regular mode.

The issue is that the run starts and then quits without an error message. The run exits before time allocation runs out.

My run steps via terminal: 1) srun --pty -t 02:00:00 -n 6 -N 1 --mem 32 /bin/bash -l (spackages are loaded from bashrc) 2) screen 3) source activate gchp_env 4) mpirun -np 6 ./gchp > term_output.txt 5) control+a+d to detach 6) top -u usrname

I monitor the run via top -u and I can see there are adequate processes allocated. The cpu oscillates from 0-90% and the memory stays at 0%. None of them change before all the processes disappear.

The logfile.000000.out ended with "tracer manager unknown". The allPEs.log file ended with "0000: MAPL.GENERIC: INFO: Started the 'Initialize' stage of the gridded component 'GCHPchem'" . The terminal output right before the run quits:

$$ Finished Reading Linoz Data $$
SET_TIMESTEPS: setting GEOS-Chem timesteps!
-------------------------------------------
Chemistry  Timestep [sec] :   1200
Convection Timestep [sec] :    600
Dynamics   Timestep [sec] :    600
Emission   Timestep [sec] :   1200
Unit Conv  Timestep [sec] :    600
Diagnostic Timestep [sec] :      0
Radiation  Timestep [sec] :  10800

GEOS-Chem version

gchp 13.3.4

Description of code modifications

No modifications to the model code.

Log files

runConfig.sh: runConfig.sh.txt
logfile.000000.out logfile.000000.out.txt
allPEs.log allPEs.log.txt
term_output.txt (terminal output log) term_output.txt

Software versions

Compilers (Intel or GNU, and version): gcc 9.3.0
NetCDF version: 4.5.3
MPI version: openmpi 4.1.2
ESMF version: 8.0.1

lizziel commented 2 years ago

Hi @arianatribby, if the model crashed there should be error messages about MPI somewhere and a traceback. Try sending both standard output and standard error to the file by adding 2>&1 to your run command: mpirun -np 6 ./gchp > term_output.txt 2>&1

That being said, you may be running out of memory. Try --mem 110G. If you only have 32G that would not be enough.

Regarding building with debug, that capability is not functional in 13.3.4 but will be available in 14.0. If you are just getting started with GCHP I highly recommend switching to 14.0 as soon as it is released this summer. It features a newer version of MAPL (software which handles I/O) with improved error handling.

lizziel commented 2 years ago

Hi @arianatribby, were you able to fix the problem?

arianatribby commented 2 years ago

Thanks, @lizziel your help allowed me to get past an initial issue but now I am back to the buffer problem. (Note that I am not in debug mode.) Here is the end of the terminal output:

EXTDATAMAPL_GenericInitialize=  4.954E+05  1.520E+02
        GENERIC: INFO: Finished the 'Initialize' stage of the gridded component 'EXTDATA'
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  4.967E+05  1.520E+02
        GENERIC: INFO: Started the  'Run' stage of the gridded component 'EXTDATA'
 /home/atribby/GCHP/src/pFlogger/src/DynamicBuffer.F90          43
 DynamicBuffer::grow_record_size() - exceeded maximum permitted buffer size.
STOP 1
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[53947,1],4]
  Exit code:    1
--------------------------------------------------------------------------

And here is the end of the logfile.000000.out

Cubic: cubed-sphere domain decomposition:     1 X    1
   pe,   is,  ie,  js,  je,    isd, ied, jsd, jed

Cubic: cubed-sphere domain decomposition:     1 X    1
   pe,   is,  ie,  js,  je,    isd, ied, jsd, jed

================================================================================
TRACER_MANAGER_MOD
unknown

The ending of the allPEs.log

0000: MAPL.GENERIC: INFO: Finished the 'Initialize' stage of the gridded component 'EXTDATA'
0000: MAPL.GENERIC: INFO: Started the  'Run' stage of the gridded component 'EXTDATA'

I appreciate your time and help.

sdeastham commented 2 years ago

Hi @arianatribby - it looks to me like this message:

/home/atribby/GCHP/src/pFlogger/src/DynamicBuffer.F90 43 DynamicBuffer::grow_record_size() - exceeded maximum permitted buffer size. STOP 1

is the most relevant one, but I'm not sure what would cause that. I'm pinging @tclune who is the author of pFlogger and might be able to comment?

tclune commented 2 years ago

This is possibly some problem in the format passed to pflogger. If the actual call to pflogger can be isolated, I'd have a better chance at diagnosing.

Basically, pflogger needs to know how large of a buffer is needed for a given INFO, DEBUG, etc log message. Rather than hardwiring a large buffer, it will try something modest and keep doubling until the buffer is large enough for the requested write statement. An improper format statement could be causing some other failure that pFlogger is interpreting as inadequate buffer space and thus giving a somewhat misleading error message.

The fact that we are not gettin a full stack trace suggests that the GEOS layer (or GCHP?) making the call is not correctly trapping this error, and that should be fixed as well. Again, need to know where to look first though. :-(

arianatribby commented 2 years ago

Hi @tclune, I'm having a bit of trouble answering your question about where the call to pflogger is happening. I've found that CodeDir/src/GCHP_GridComp/CMakeLists.txt calls pflogger but I don't see a path or dir that I could print that might be helpful. The above path seemed relevant to me because the allPEs.log file was printing "MAPL.GENERIC: INFO: Started the 'Run' stage of the gridded component 'EXTDATA'" before the error happened.

Also, I noticed that the error happens during GFED emissions. Here is the ending of the HEMCO.log file before the error:

   - Emit GFED species C2H2     as model species C2H2
      --> Will use scale factor:    1.00000000
      --> Will use scale field : none
   - Emit GFED species C2H4     as model species C2H4
      --> Will use scale factor:    1.00000000

So it is stopping at species C2H4 since "Will use scale field: none" doesn't print. But @sdeastham I am a little confused on how to find whether there is an issue with this specific species since the GFED4 nc file is organized by overall categories of biomass burning emissions rather than those individual species. Here is the relevant part of the HEMCO_config file where the species are read in order that they are printed in the log:

111     GFED                   : on    NO/CO/ALK4/ACET/MEK/ALD2/PRPE/C2H2/C2H4/C3H8/CH2O/C2H6/SO2/NH3/BCPO/BCPI/OCPO/OCPI/POG1/POG2/MTPA/BENZ/TOLU/XYLE/NAP/EOH/MOH/SOAP
    --> GFED4                  :       true

Thanks. Please let me know if there are any specific files you want or things that you think I should try.

tclune commented 2 years ago

@arianatribby - I understand that the ultimate bracketing will need to be on our end. But we need something that we can reproduce before we can do that. Possibly the above details combined with your input file will let us try.

lizziel commented 2 years ago

Hi @arianatribby, I am wondering if there is an input file or config issue that is not properly caught and escalating to trigger the pflogger error. Could you upload the following files?

ExtData.rc
HEMCO_Config.rc
logging.yml
gchp.log
HEMCO.log

When you say you are not in debug mode, do you mean you did not compile with -DCMAKE_BUILD_TYPE=Debug, or do you mean you did not enable DEBUG in logging.yml file for ExtData component?

arianatribby commented 2 years ago

Hi @lizziel, I've attached the files. I did not compile with -DCMAKE_BUILD_TYPE=Debug. I also did not change the logging.yml.

HEMCO_Config.rc.txt ExtData.rc.txt logging.yml.txt logfile.000000.out.txt

The HEMCO log file can be found here: https://caltech.box.com/s/tdaogktmnwrd2qkc7vun1i9kg3wjw6qg

Thanks!

lizziel commented 2 years ago

Hi @arianatribby, could you also provide the main GCHP log file? The file logfile.000000.out is primarily a log for advection which is not relevant to this issue. The main GCHP log file would be whatever file you are sending standard output to. In your original post it looks like you are naming it term_output.txt. We call it gchp.log in our example run scripts, hence my request using that name.

From the files you provided I see you don't have maximum prints on. Your original post said the output was too large, but I wonder if now that you fixed that by requesting more memory you can also output more to the log files. You can set verbose and warnings in HEMCO_Config.rc to 3 to get maximum information in the HEMCO log file. Also update logging.yml to print more info to allPes.logregarding inputs. Set CAP.EXTDATA entry for root_level to DEBUG. Then try rerunning. You do not need to recompile.

Another way to hone in on the problem is to try turning the GFED extension off. If the only problem is GFED then you should then be able to run successfully, and if there are multiple problems with GFED being one of them then you should be able to get further in the run.

arianatribby commented 2 years ago

Hi @lizziel , thanks! I re-ran using max print statements and re-attached all the files you asked for. ExtData.rc.txt HEMCO_Config.rc.txt logging.yml.txt gchp.log.txt HEMCO.log: https://caltech.box.com/s/se4f2ctdci69okbt50168fc38pz0s2kc

lizziel commented 2 years ago

Thanks! Could you also include the rest of the log files beyond HEMCO.log? The ones to get are allPEs.log (this contains the output of the logger you configured in logging.yml) and whatever log you have configured for standard output and standard error. In your earlier post you had that as term_output.txt.

lizziel commented 2 years ago

Sorry, just noticed you included gchp.log. All you need to post is allPEs.log. Thanks!

arianatribby commented 2 years ago

Sorry @lizziel here it is: allPEs.log.txt

arianatribby commented 2 years ago

Also, just want to confirm that both GEOS-Chem classic 13.0.0 and GCHP 13.3.4 should have spin up, like all other versions of geos chem? (Ie 1 year of spin up?)

lizziel commented 2 years ago

@tclune, does this new information shed any light on what the issue could be? The last print before the model crashed is this: call lgr%debug(' >> REFFTIME for %a~: %a',trim(item%file), '<'//trim(item%FileReffTime)//'>')

It is trying to print the file path and reference time for the very first file in ExtData.rc. The line is: ALBD 1 N Y F0;003000 none none ALBEDO ./MetDir/%y4/%m2/GEOSFP.%y4%m2%d2.A1.025x03125.nc

However, the resulting print is botched, with the filename truncated and no value for ref time: EXTDATA: DEBUG: >> REFFTIME for ./MetDir/%y4/%m2/GE: <>

As a reminder, the error message is:

 /home/atribby/GCHP/src/pFlogger/src/DynamicBuffer.F90          43
 DynamicBuffer::grow_record_size() - exceeded maximum permitted buffer size.
STOP 1

I checked the traceback for that in our version of pFlogger and it is here which traces to code in FormatString.F90 (function handleScalar).

Any ideas? For reference, this is pFlogger 1.5.0.

tclune commented 2 years ago

@lizziel Despite that I said in my email, I have a few moments now, and the browser was still open to this issue. My first thought is that this is gfortran and it is losing the length of one of the strings while pFlogger is transporting them around as class(*) objects.

The truncation is merely a side-effect of the error about exceeding the maximum buffer size.

One workaround to attempt would be to introduce string variables before the debug() call. These would hold trim(item%file) and trim(item%FileReffTime). I suspect it is the use of expressions that is tripping the gfortran bug.

Using a more recent gfortran may also just make the issue go away. But if this is Intel, my explanation would seem less likely. In any event, please let me know which compiler/version is showing these symptoms.

lizziel commented 2 years ago

The libraries are: Compilers (Intel or GNU, and version): gcc 9.3.0 NetCDF version: 4.5.3 MPI version: openmpi 4.1.2 ESMF version: 8.0.1

However, I just tried to reproduce the problem using gfortran 9.3.0 + OpenMPI and did not run into the issue.

@arianatribby, would you be able to upgrade to GCHP 14.0.0 and try again? With that version you can run in debug mode.

lizziel commented 2 years ago

@arianatribby, I should add that 14.0 has not officially been released but will be very soon. You can use tag 14.0.0-rc.1. Documentation for the version is here (that is the docs/dev version of GCHP ReadTheDocs).

lizziel commented 2 years ago

@arianatribby, were you able to resolve this issue?

arianatribby commented 2 years ago

Hi @lizziel, thanks so much for your time on this issue but to expedite my research I decided to run GCClassic on AWS, which I did not have issues running. Hopefully I can revisit GCHP in the future!

lizziel commented 2 years ago

Okay, I'm glad you at least have a workaround. If you decide to try GCHP again, either on your cluster or on AWS, let us know if you run into any issues or have questions.

geoschem / GCHP