geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] Crashes when writing both StateMet_avg and StateMet_inst #12

Closed JiaweiZhuang closed 5 years ago

JiaweiZhuang commented 5 years ago

After the fix https://github.com/geoschem/gchp/issues/6#issuecomment-447475255 there remains one issue: with both StateMet_avg and StateMet_inst turned on, the run crashes at 01:00 when writting the first diagnostics.

Tested with GC version 12.1.1 and GCHP branch bugfix/GCHP_issues, and OpenMPI3.

Those cases can finish and print full timing info:

Those cases crash:

Not a critical problem as I can just turn off StateMet for either tutorial or benchmark purpose. Will proceed to make the tutorial AMI.

JiaweiZhuang commented 5 years ago

BTW the run crashes at 00:10 (first time step) when all diagnostics are turned on: run_all_diag.log

JiaweiZhuang commented 5 years ago

Made a new AMI with GC-classic + GCHP 12.1.1 and OpenMPI3 : ami-06f4d4afd350f6e4c or GEOSChem_with_GCHP_12.1.1_tutorial_20181216. Will use it as the new tutorial.

The default case runs without problems. But if you turn on 'StateMet_avg' and 'StateMet_inst' it will crash at 01:00

lizziel commented 5 years ago

I suspect this has to do with the deallocation/nullification of State_Met in GCHP. One of us will take a look.


From: Jiawei Zhuang notifications@github.com Sent: Saturday, December 15, 2018 4:24:47 PM To: geoschem/gchp Cc: Subscribed Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)

Made a new AMI with GC-classic + GCHP 12.1.1 and OpenMPI3 : ami-06f4d4afd350f6e4c or GEOSChem_with_GCHP_12.1.1_tutorial_20181216. Will use it as the new tutorial.

The default case runs without problems. But if you turn on 'StateMet_avg' and 'StateMet_inst' it will crash at 01:00

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447598850&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=z7gcVqmEtJp6KdhaIa7b7b-1wruo07NOXXl_Ketn4pw&s=gbzfPTzoU0gga9Mo91clphZE3GvIyIjjQBvf3f2zqC8&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq4X-2DX7UFRTbvT70421nIGWwOLJc3ks5u5WifgaJpZM4ZUiR6&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=z7gcVqmEtJp6KdhaIa7b7b-1wruo07NOXXl_Ketn4pw&s=ZqgwKQ5FtMyO3qSmuqTsO6CXzVlWp-Hd3vZYk8013d0&e=.

yantosca commented 5 years ago

I am running on the AWS cloud (r5.2xlarge instance) with the new AMI (2018/12/16) that Jiawei made. The only diagnostic collection that I am archiving is StateMet_avg. When I archive 3 fields: Met_AIRDEN, Met_AIRVOL, then the run dies right at 01:00:00 without writing anything to HISTORY.

AGCM Date: 2016/07/01  Time: 00:50:00
                                             Memuse(MB) at MAPL_Cap:TimeLoop=  4.657E+03  4.417E+03  2.236E+03  2.614E+03  0.000E+00
                                                                      Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.867E+04  0.000E+00
 offline_tracer_advection
 GEOS-Chem phase           -1 :
 DoConv   :  T
 DoDryDep :  F
 DoEmis   :  F
 DoTend   :  F
 DoTurb   :  T
 DoChem   :  F
 DoWetDep :  T

     ### Species Unit Conversion: v/v dry -> kg/kg dry ###
  --- Do convection now
  --- Convection done!
  --- Do turbulence now
     ### Species Unit Conversion: kg/kg dry -> v/v dry ###
     ### VDIFFDR: VDIFFDR begins
     ### VDIFFDR: after emis. and depdrp
     ### VDIFFDR: before vdiff
     ### VDIFF: vdiff begins
     ### VDIFF: diffusion begins
     ### VDIFF: compute free atmos. diffusion
     ### VDIFF: pbldif begins
     ### VDIFF: after pbldif
     ### VDIFF: starting diffusion
     ### VDIFF: vdiff begins
     ### VDIFF: diffusion begins
     ### VDIFF: compute free atmos. diffusion
     ### VDIFF: pbldif begins
     ### VDIFF: after pbldif
     ### VDIFF: starting diffusion
     ### VDIFF: vdiff begins
     ### VDIFF: diffusion begins
     ### VDIFF: compute free atmos. diffusion
     ### VDIFF: pbldif begins
     ### VDIFF: after pbldif
     ### VDIFF: starting diffusion
     ### VDIFF: vdiff begins
     ### VDIFF: diffusion begins
     ### VDIFF: compute free atmos. diffusion
     ### VDIFF: pbldif begins
     ### VDIFF: after pbldif
     ### VDIFF: starting diffusion
     ### VDIFFDR: after vdiff
     ### VDIFFDR: VDIFFDR finished
     ### DO_PBL_MIX_2: after VDIFFDR
     ### DO_PBL_MIX_2: after AIRQNT
     ### Species Unit Conversion: v/v dry -> kg/kg dry ###
     ### Species Unit Conversion: kg/kg dry -> kg/m2 ###
     ### Species Unit Conversion: kg/m2 -> kg/kg dry ###
  --- Turbulence done!
     ### Species Unit Conversion: kg/kg dry -> v/v dry ###
     ### Species Unit Conversion: v/v dry -> kg/kg dry ###
  --- Do wetdep now
     ### DO_WETDEP: before LS wetdep
     ### Species Unit Conversion: kg/kg dry -> kg/m2 ###
     ### Species Unit Conversion: kg/m2 -> kg/kg dry ###
     ### DO_WETDEP: after LS wetdep
  --- Wetdep done!
     ### Species Unit Conversion: kg/kg dry -> v/v dry ###
 AGCM Date: 2016/07/01  Time: 01:00:00

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x146f034ed2da in ???
#1  0x146f034ec503 in ???
#2  0x146f02929f1f in ???
... etc ...
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-93-224 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

But when I archive only Met_AD and Met_AIRVOL (the 1st 2 fields), then the run finishes OK and prints out all timing info. The log file still says though:

mpirun has exited due to process rank 3 with PID 0 on
node ip-172-31-93-224 exiting improperly.

But that was happening in runs that finished successfully on AWS. (This also triggered a core-dump on Odyssey, which might be a SLURM issue as described in #11.)

So it seems that adding one more diagnostic export somehow triggers a crash without writing anything to disk. Maybe this is a memory issue with MPI? Don't know. Maybe we haven't maxed out environment settings (but I think that's accounted for in gchp.env). Or there isn't enough memory in the AWS instance. But I think r5.2xlarge has 128GB so that should be enough.

Ideas? ...

lizziel commented 5 years ago

It seems odd that it would be a memory issue since Jiawei was able to output all diagnostics except State_met. Are you able to output any combination of two State_met diagnostics but not three? Is it the same for both state_met_avg and state_met_inst? Do you get the same results for a 2-hr duration run with 1-hr diagnostics?

-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team

Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.

From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Monday, December 17, 2018 at 12:23 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Comment comment@noreply.github.com Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)

I am running on the AWS cloud (r5.2xlarge instance) with the new AMI (2018/12/16) that Jiawei made. The only diagnostic collection that I am archiving is StateMet_avg. When I archive 3 fields: Met_AIRDEN, Met_AIRVOL, then the run dies right at 01:00:00 without writing anything to HISTORY.

AGCM Date: 2016/07/01 Time: 00:50:00

                                         Memuse(MB) at MAPL_Cap:TimeLoop=  4.657E+03  4.417E+03  2.236E+03  2.614E+03  0.000E+00

                                                                  Mem/Swap Used (MB) at MAPL_Cap:TimeLoop=  1.867E+04  0.000E+00

offline_tracer_advection

GEOS-Chem phase -1 :

DoConv : T

DoDryDep : F

DoEmis : F

DoTend : F

DoTurb : T

DoChem : F

DoWetDep : T

 ### Species Unit Conversion: v/v dry -> kg/kg dry ###

--- Do convection now

--- Convection done!

--- Do turbulence now

 ### Species Unit Conversion: kg/kg dry -> v/v dry ###

 ### VDIFFDR: VDIFFDR begins

 ### VDIFFDR: after emis. and depdrp

 ### VDIFFDR: before vdiff

 ### VDIFF: vdiff begins

 ### VDIFF: diffusion begins

 ### VDIFF: compute free atmos. diffusion

 ### VDIFF: pbldif begins

 ### VDIFF: after pbldif

 ### VDIFF: starting diffusion

 ### VDIFF: vdiff begins

 ### VDIFF: diffusion begins

 ### VDIFF: compute free atmos. diffusion

 ### VDIFF: pbldif begins

 ### VDIFF: after pbldif

 ### VDIFF: starting diffusion

 ### VDIFF: vdiff begins

 ### VDIFF: diffusion begins

 ### VDIFF: compute free atmos. diffusion

 ### VDIFF: pbldif begins

 ### VDIFF: after pbldif

 ### VDIFF: starting diffusion

 ### VDIFF: vdiff begins

 ### VDIFF: diffusion begins

 ### VDIFF: compute free atmos. diffusion

 ### VDIFF: pbldif begins

 ### VDIFF: after pbldif

 ### VDIFF: starting diffusion

 ### VDIFFDR: after vdiff

 ### VDIFFDR: VDIFFDR finished

 ### DO_PBL_MIX_2: after VDIFFDR

 ### DO_PBL_MIX_2: after AIRQNT

 ### Species Unit Conversion: v/v dry -> kg/kg dry ###

 ### Species Unit Conversion: kg/kg dry -> kg/m2 ###

 ### Species Unit Conversion: kg/m2 -> kg/kg dry ###

--- Turbulence done!

 ### Species Unit Conversion: kg/kg dry -> v/v dry ###

 ### Species Unit Conversion: v/v dry -> kg/kg dry ###

--- Do wetdep now

 ### DO_WETDEP: before LS wetdep

 ### Species Unit Conversion: kg/kg dry -> kg/m2 ###

 ### Species Unit Conversion: kg/m2 -> kg/kg dry ###

 ### DO_WETDEP: after LS wetdep

--- Wetdep done!

 ### Species Unit Conversion: kg/kg dry -> v/v dry ###

AGCM Date: 2016/07/01 Time: 01:00:00

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:

0 0x146f034ed2da in ???

1 0x146f034ec503 in ???

2 0x146f02929f1f in ???

... etc ...


Primary job terminated normally, but 1 process returned

a non-zero exit code. Per user-direction, the job has been aborted.



mpirun noticed that process rank 0 with PID 0 on node ip-172-31-93-224 exited on signal 11 (Segmentation fault).


But when I archive only Met_AD and Met_AIRVOL (the 1st 2 fields), then the run finishes OK and prints out all timing info. The log file still says though:

mpirun has exited due to process rank 3 with PID 0 on

node ip-172-31-93-224 exiting improperly.

But that was happening in runs that finished successfully on AWS. (This also triggered a core-dump on Odyssey, which might be a SLURM issue as described in #11https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_11&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=M31Y7Y_TnzElPAxLTQM__Eb7QwzyaENmf5HBnS41bCM&s=NxXLSbVBS2GXU1jubPRmRVizm1sV3ZS9fv1UJ99maYU&e=.)

So it seems that adding one more diagnostic export somehow triggers a crash without writing anything to disk. Maybe this is a memory issue with MPI? Don't know. Maybe we haven't maxed out environment settings (but I think that's accounted for in gchp.env). Or there isn't enough memory in the AWS instance. But I think r5.2xlarge has 128GB so that should be enough.

Ideas? ...

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447927013&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=M31Y7Y_TnzElPAxLTQM__Eb7QwzyaENmf5HBnS41bCM&s=ZEx0aOIqNUIistOmozxuybHrMl0h5JqPinbtMFdwmQE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyqzc8aQJ5TDDKcyACep0TY1XZ1M4Yks5u59MWgaJpZM4ZUiR6&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=M31Y7Y_TnzElPAxLTQM__Eb7QwzyaENmf5HBnS41bCM&s=UR4P5_OwbMIhZeHJzh8e1WoTH5kkoz676uIGbCuXne8&e=.

JiaweiZhuang commented 5 years ago

Interestingly, running GCHP inside container fixes this problem: run_gchp_singularity.log It writes all 4 collections without problems and prints full timing info at the end. That's really weird because the container uses exactly the same libraries as the AMI...

I have an entirely new chapter on using containers. Will send an email with more details.

lizziel commented 5 years ago

Interesting. Could you try a few more times, including with all diagnostics on, to make sure this isn’t an occasional error?

-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team

Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.

From: Jiawei Zhuang notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Monday, December 17, 2018 at 1:05 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Comment comment@noreply.github.com Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)

Interestingly, running GCHP inside container fixes this problem: run_gchp_singularity.loghttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_files_2687299_run-5Fgchp-5Fsingularity.log&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=Oc9oBCsIWc4FWhVWiU8D2X-2Cg4PF17zaOvPYxA9mI4&e= It writes all 4 collections without problems and prints full timing info at the end. That's really weird because the container uses exactly the same libraries as the AMI...

I have an entirely new chapter on using containershttps://urldefense.proofpoint.com/v2/url?u=https-3A__cloud-2Dgc.readthedocs.io_en_latest_chapter03-5Fadvanced-2Dtutorial_container.html&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=J0HjpHfL3Yzas6IZYh1-mv0gKc-SI63IIJOnC6sGXqE&e=. Will send an email with more details.

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447940839&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=Qj2H_S9pt3a8I3HxZTaYZztAh4w_oIJjJLEoYlJciPk&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyqw3jeCcmlHscczb2DTlHXmisg4F5ks5u590AgaJpZM4ZUiR6&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=u77n-oDeLpZgiU2MXAYFrbJ9XAhTk7OOHrnXf_QEeRc&e=.

yantosca commented 5 years ago

Also I just now was able to save run a 2-hr simulation with just saving out the StateMet_inst collection, with all fields archived.

JiaweiZhuang commented 5 years ago

to make sure this isn’t an occasional error?

The error on AWS AMI happens consistently in repeated runs...

yantosca commented 5 years ago

Very weird, I also just now was able to run a 2-hr simulation saving out all fields of StateMet_avg.

Going to try again compiling and running from scratch

lizziel commented 5 years ago

The reports make it sound like this issue is not consistent on the AMI. Bob, could you test for a little while and then post a report so that it is easier to track? I think Jiawei’s original issue was that the error only happened when both statemet_inst and statemet_avg were output.

-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team

Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.

From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Monday, December 17, 2018 at 1:20 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Comment comment@noreply.github.com Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)

Very weird, I also just now was able to run a 2-hr simulation saving out all fields of StateMet_avg.

Going to try again compiling and running from scratch

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447945888&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=Byea0c_LnbuwwlwVIdfoNhP1_4_K4wbfMR15ZLeAvOA&s=pLq1l9M_xqg8VtRJkVBxHVOAuCjT4s65zyofBQA8w4k&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq4Xz42Io-2D65CD9Yb2SBu8CZNsaJoks5u5-2DB8gaJpZM4ZUiR6&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=Byea0c_LnbuwwlwVIdfoNhP1_4_K4wbfMR15ZLeAvOA&s=JJe1Sp0wDS1sudXK0JBmn_VCuNQCDTmOG_yg3eYYEyA&e=.

yantosca commented 5 years ago

I ran a grid of GCHP runs directly on the AMI (no containers), turning on various combinations of SpeciesConc_{avg,inst} and StateMet{avg,inst}. As you can see, we get a variety of results.

Consistent results: Saving out SpeciesConc_avg, SpeciesConc_inst, StateMet_avg, State_Met_inst dies at 00:10 Saving out StateMet_avg and StateMet_inst died at 01:00

But turning on at least one of the SpeciesConc collections and both State_met collections seems to finish just fine. Or turning on one StateMet collection but not the other.

NOTE: SC=SpeciesConc, SM=State_Met

AMI      : GEOSChem_with_GCHP_12.1.1_tutorial_20181216 (ami-06f4d4afd350f6e4c)
Instance : r5.2xlarge

All runs were done in the AMI, no containers used.

SC_avg SC_inst SM_avg SM_inst  Result               NOTES
=====================================================================================
  ON     ON      ON     ON     DIED      @ 00:10    All fields of SC and SM requested
  OFF    ON      ON     ON     Finished* @ 02:00    All fields of SC and SM requested      
  ON     OFF     ON     ON     Finished* @ 02:00    All fields of SC and SM requested
  OFF    OFF     ON     ON     DIED      @ 01:00    All fields of SM requested
  OFF    OFF     OFF    ON     Finished* @ 02:00    All fields of SM requested
  OFF    OFF     ON     OFF    Finished* @ 02:00    All fields of SM requested                 
  ON     ON      OFF    OFF    Finished* @ 02:00    All fields of SC requested

Reruns:
  ON     ON      ON     ON     DIED      @ 00:00    Died at MAPL_ExtDataInterpField l. 3240 
  OFF    OFF     ON     ON     DIED      @ 01:00    Segmentation fault

Finished* = The run finished normally and saved out all output files, 
and printed out timing info down to ExtData...but also printed this 
message to the stderr output.

 Backtrace for this error:
 #0  0x150792b682da in ???
 #1  0x150792b67503 in ???
 ... etc ...
 #19  0xffffffffffffffff in ???
 --------------------------------------------------------------------------
 mpirun has exited due to process rank 3 with PID 0 on
 node ip-172-31-93-224 exiting improperly. There are three reasons this could occur:

 1. this process did not call "init" before exiting, but others in
 the job did. This can cause a job to hang indefinitely while it waits
 for all processes to call "init". By rule, if one process calls "init",
 then ALL processes must call "init" prior to termination.

 2. this process called "init", but exited without calling "finalize".
 By rule, all processes that call "init" MUST call "finalize" prior to
 exiting or it will be considered an "abnormal termination"

 3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
 orte_create_session_dirs is set to false. In this case, the run-time cannot
 detect that the abort call was an abnormal termination. Hence, the only
 error message you will receive is this one.

 This may have caused other processes in the application to be
 terminated by signals sent by mpirun (as reported here).

 You can avoid this message by specifying -quiet on the mpirun command line.
 --------------------------------------------------------------------------

I guess we shouldn't worry about this too much if the runs can consistently work within a container. That may be the best solution going forward.

JiaweiZhuang commented 5 years ago

Thanks for the thorough tests. Probably mark it as a long-term issue... Might have problems for other diagnostics collections as well. Turning on all collections clearly crashes the run https://github.com/geoschem/gchp/issues/12#issuecomment-447531664; not sure which one is causing the problem.

yantosca commented 5 years ago

I believe that the root cause of this issue is #15.

lizziel commented 5 years ago

I am closing this issue since it is resolved with MAPL update in 12.5.