Closed JiaweiZhuang closed 5 years ago
BTW the run crashes at 00:10 (first time step) when all diagnostics are turned on: run_all_diag.log
Made a new AMI with GC-classic + GCHP 12.1.1 and OpenMPI3 :
ami-06f4d4afd350f6e4c
or GEOSChem_with_GCHP_12.1.1_tutorial_20181216
. Will use it as the new tutorial.
The default case runs without problems. But if you turn on 'StateMet_avg' and 'StateMet_inst' it will crash at 01:00
I suspect this has to do with the deallocation/nullification of State_Met in GCHP. One of us will take a look.
From: Jiawei Zhuang notifications@github.com Sent: Saturday, December 15, 2018 4:24:47 PM To: geoschem/gchp Cc: Subscribed Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)
Made a new AMI with GC-classic + GCHP 12.1.1 and OpenMPI3 : ami-06f4d4afd350f6e4c or GEOSChem_with_GCHP_12.1.1_tutorial_20181216. Will use it as the new tutorial.
The default case runs without problems. But if you turn on 'StateMet_avg' and 'StateMet_inst' it will crash at 01:00
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447598850&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=z7gcVqmEtJp6KdhaIa7b7b-1wruo07NOXXl_Ketn4pw&s=gbzfPTzoU0gga9Mo91clphZE3GvIyIjjQBvf3f2zqC8&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq4X-2DX7UFRTbvT70421nIGWwOLJc3ks5u5WifgaJpZM4ZUiR6&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=z7gcVqmEtJp6KdhaIa7b7b-1wruo07NOXXl_Ketn4pw&s=ZqgwKQ5FtMyO3qSmuqTsO6CXzVlWp-Hd3vZYk8013d0&e=.
I am running on the AWS cloud (r5.2xlarge instance) with the new AMI (2018/12/16) that Jiawei made. The only diagnostic collection that I am archiving is StateMet_avg. When I archive 3 fields: Met_AIRDEN, Met_AIRVOL, then the run dies right at 01:00:00 without writing anything to HISTORY.
AGCM Date: 2016/07/01 Time: 00:50:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.657E+03 4.417E+03 2.236E+03 2.614E+03 0.000E+00
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.867E+04 0.000E+00
offline_tracer_advection
GEOS-Chem phase -1 :
DoConv : T
DoDryDep : F
DoEmis : F
DoTend : F
DoTurb : T
DoChem : F
DoWetDep : T
### Species Unit Conversion: v/v dry -> kg/kg dry ###
--- Do convection now
--- Convection done!
--- Do turbulence now
### Species Unit Conversion: kg/kg dry -> v/v dry ###
### VDIFFDR: VDIFFDR begins
### VDIFFDR: after emis. and depdrp
### VDIFFDR: before vdiff
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFFDR: after vdiff
### VDIFFDR: VDIFFDR finished
### DO_PBL_MIX_2: after VDIFFDR
### DO_PBL_MIX_2: after AIRQNT
### Species Unit Conversion: v/v dry -> kg/kg dry ###
### Species Unit Conversion: kg/kg dry -> kg/m2 ###
### Species Unit Conversion: kg/m2 -> kg/kg dry ###
--- Turbulence done!
### Species Unit Conversion: kg/kg dry -> v/v dry ###
### Species Unit Conversion: v/v dry -> kg/kg dry ###
--- Do wetdep now
### DO_WETDEP: before LS wetdep
### Species Unit Conversion: kg/kg dry -> kg/m2 ###
### Species Unit Conversion: kg/m2 -> kg/kg dry ###
### DO_WETDEP: after LS wetdep
--- Wetdep done!
### Species Unit Conversion: kg/kg dry -> v/v dry ###
AGCM Date: 2016/07/01 Time: 01:00:00
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x146f034ed2da in ???
#1 0x146f034ec503 in ???
#2 0x146f02929f1f in ???
... etc ...
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-93-224 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
But when I archive only Met_AD and Met_AIRVOL (the 1st 2 fields), then the run finishes OK and prints out all timing info. The log file still says though:
mpirun has exited due to process rank 3 with PID 0 on
node ip-172-31-93-224 exiting improperly.
But that was happening in runs that finished successfully on AWS. (This also triggered a core-dump on Odyssey, which might be a SLURM issue as described in #11.)
So it seems that adding one more diagnostic export somehow triggers a crash without writing anything to disk. Maybe this is a memory issue with MPI? Don't know. Maybe we haven't maxed out environment settings (but I think that's accounted for in gchp.env). Or there isn't enough memory in the AWS instance. But I think r5.2xlarge has 128GB so that should be enough.
Ideas? ...
It seems odd that it would be a memory issue since Jiawei was able to output all diagnostics except State_met. Are you able to output any combination of two State_met diagnostics but not three? Is it the same for both state_met_avg and state_met_inst? Do you get the same results for a 2-hr duration run with 1-hr diagnostics?
-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team
Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.
From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Monday, December 17, 2018 at 12:23 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Comment comment@noreply.github.com Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)
I am running on the AWS cloud (r5.2xlarge instance) with the new AMI (2018/12/16) that Jiawei made. The only diagnostic collection that I am archiving is StateMet_avg. When I archive 3 fields: Met_AIRDEN, Met_AIRVOL, then the run dies right at 01:00:00 without writing anything to HISTORY.
AGCM Date: 2016/07/01 Time: 00:50:00
Memuse(MB) at MAPL_Cap:TimeLoop= 4.657E+03 4.417E+03 2.236E+03 2.614E+03 0.000E+00
Mem/Swap Used (MB) at MAPL_Cap:TimeLoop= 1.867E+04 0.000E+00
offline_tracer_advection
GEOS-Chem phase -1 :
DoConv : T
DoDryDep : F
DoEmis : F
DoTend : F
DoTurb : T
DoChem : F
DoWetDep : T
### Species Unit Conversion: v/v dry -> kg/kg dry ###
--- Do convection now
--- Convection done!
--- Do turbulence now
### Species Unit Conversion: kg/kg dry -> v/v dry ###
### VDIFFDR: VDIFFDR begins
### VDIFFDR: after emis. and depdrp
### VDIFFDR: before vdiff
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFF: vdiff begins
### VDIFF: diffusion begins
### VDIFF: compute free atmos. diffusion
### VDIFF: pbldif begins
### VDIFF: after pbldif
### VDIFF: starting diffusion
### VDIFFDR: after vdiff
### VDIFFDR: VDIFFDR finished
### DO_PBL_MIX_2: after VDIFFDR
### DO_PBL_MIX_2: after AIRQNT
### Species Unit Conversion: v/v dry -> kg/kg dry ###
### Species Unit Conversion: kg/kg dry -> kg/m2 ###
### Species Unit Conversion: kg/m2 -> kg/kg dry ###
--- Turbulence done!
### Species Unit Conversion: kg/kg dry -> v/v dry ###
### Species Unit Conversion: v/v dry -> kg/kg dry ###
--- Do wetdep now
### DO_WETDEP: before LS wetdep
### Species Unit Conversion: kg/kg dry -> kg/m2 ###
### Species Unit Conversion: kg/m2 -> kg/kg dry ###
### DO_WETDEP: after LS wetdep
--- Wetdep done!
### Species Unit Conversion: kg/kg dry -> v/v dry ###
AGCM Date: 2016/07/01 Time: 01:00:00
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
... etc ...
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
mpirun noticed that process rank 0 with PID 0 on node ip-172-31-93-224 exited on signal 11 (Segmentation fault).
But when I archive only Met_AD and Met_AIRVOL (the 1st 2 fields), then the run finishes OK and prints out all timing info. The log file still says though:
mpirun has exited due to process rank 3 with PID 0 on
node ip-172-31-93-224 exiting improperly.
But that was happening in runs that finished successfully on AWS. (This also triggered a core-dump on Odyssey, which might be a SLURM issue as described in #11https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_11&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=M31Y7Y_TnzElPAxLTQM__Eb7QwzyaENmf5HBnS41bCM&s=NxXLSbVBS2GXU1jubPRmRVizm1sV3ZS9fv1UJ99maYU&e=.)
So it seems that adding one more diagnostic export somehow triggers a crash without writing anything to disk. Maybe this is a memory issue with MPI? Don't know. Maybe we haven't maxed out environment settings (but I think that's accounted for in gchp.env). Or there isn't enough memory in the AWS instance. But I think r5.2xlarge has 128GB so that should be enough.
Ideas? ...
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447927013&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=M31Y7Y_TnzElPAxLTQM__Eb7QwzyaENmf5HBnS41bCM&s=ZEx0aOIqNUIistOmozxuybHrMl0h5JqPinbtMFdwmQE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyqzc8aQJ5TDDKcyACep0TY1XZ1M4Yks5u59MWgaJpZM4ZUiR6&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=M31Y7Y_TnzElPAxLTQM__Eb7QwzyaENmf5HBnS41bCM&s=UR4P5_OwbMIhZeHJzh8e1WoTH5kkoz676uIGbCuXne8&e=.
Interestingly, running GCHP inside container fixes this problem: run_gchp_singularity.log It writes all 4 collections without problems and prints full timing info at the end. That's really weird because the container uses exactly the same libraries as the AMI...
I have an entirely new chapter on using containers. Will send an email with more details.
Interesting. Could you try a few more times, including with all diagnostics on, to make sure this isn’t an occasional error?
-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team
Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.
From: Jiawei Zhuang notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Monday, December 17, 2018 at 1:05 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Comment comment@noreply.github.com Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)
Interestingly, running GCHP inside container fixes this problem: run_gchp_singularity.loghttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_files_2687299_run-5Fgchp-5Fsingularity.log&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=Oc9oBCsIWc4FWhVWiU8D2X-2Cg4PF17zaOvPYxA9mI4&e= It writes all 4 collections without problems and prints full timing info at the end. That's really weird because the container uses exactly the same libraries as the AMI...
I have an entirely new chapter on using containershttps://urldefense.proofpoint.com/v2/url?u=https-3A__cloud-2Dgc.readthedocs.io_en_latest_chapter03-5Fadvanced-2Dtutorial_container.html&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=J0HjpHfL3Yzas6IZYh1-mv0gKc-SI63IIJOnC6sGXqE&e=. Will send an email with more details.
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447940839&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=Qj2H_S9pt3a8I3HxZTaYZztAh4w_oIJjJLEoYlJciPk&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyqw3jeCcmlHscczb2DTlHXmisg4F5ks5u590AgaJpZM4ZUiR6&d=DwMFaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=2EYAJFFSn35kItgGqBk_PhaIPd7CBnhV-wRMY3Fg2yo&s=u77n-oDeLpZgiU2MXAYFrbJ9XAhTk7OOHrnXf_QEeRc&e=.
Also I just now was able to save run a 2-hr simulation with just saving out the StateMet_inst collection, with all fields archived.
to make sure this isn’t an occasional error?
The error on AWS AMI happens consistently in repeated runs...
Very weird, I also just now was able to run a 2-hr simulation saving out all fields of StateMet_avg.
Going to try again compiling and running from scratch
The reports make it sound like this issue is not consistent on the AMI. Bob, could you test for a little while and then post a report so that it is easier to track? I think Jiawei’s original issue was that the error only happened when both statemet_inst and statemet_avg were output.
-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team
Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.
From: Bob Yantosca notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Monday, December 17, 2018 at 1:20 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, Comment comment@noreply.github.com Subject: Re: [geoschem/gchp] Crashes when writing both StateMet_avg and StateMet_inst (#12)
Very weird, I also just now was able to run a 2-hr simulation saving out all fields of StateMet_avg.
Going to try again compiling and running from scratch
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_12-23issuecomment-2D447945888&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=Byea0c_LnbuwwlwVIdfoNhP1_4_K4wbfMR15ZLeAvOA&s=pLq1l9M_xqg8VtRJkVBxHVOAuCjT4s65zyofBQA8w4k&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq4Xz42Io-2D65CD9Yb2SBu8CZNsaJoks5u5-2DB8gaJpZM4ZUiR6&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=Byea0c_LnbuwwlwVIdfoNhP1_4_K4wbfMR15ZLeAvOA&s=JJe1Sp0wDS1sudXK0JBmn_VCuNQCDTmOG_yg3eYYEyA&e=.
I ran a grid of GCHP runs directly on the AMI (no containers), turning on various combinations of SpeciesConc_{avg,inst} and StateMet{avg,inst}. As you can see, we get a variety of results.
Consistent results: Saving out SpeciesConc_avg, SpeciesConc_inst, StateMet_avg, State_Met_inst dies at 00:10 Saving out StateMet_avg and StateMet_inst died at 01:00
But turning on at least one of the SpeciesConc collections and both State_met collections seems to finish just fine. Or turning on one StateMet collection but not the other.
NOTE: SC=SpeciesConc, SM=State_Met
AMI : GEOSChem_with_GCHP_12.1.1_tutorial_20181216 (ami-06f4d4afd350f6e4c)
Instance : r5.2xlarge
All runs were done in the AMI, no containers used.
SC_avg SC_inst SM_avg SM_inst Result NOTES
=====================================================================================
ON ON ON ON DIED @ 00:10 All fields of SC and SM requested
OFF ON ON ON Finished* @ 02:00 All fields of SC and SM requested
ON OFF ON ON Finished* @ 02:00 All fields of SC and SM requested
OFF OFF ON ON DIED @ 01:00 All fields of SM requested
OFF OFF OFF ON Finished* @ 02:00 All fields of SM requested
OFF OFF ON OFF Finished* @ 02:00 All fields of SM requested
ON ON OFF OFF Finished* @ 02:00 All fields of SC requested
Reruns:
ON ON ON ON DIED @ 00:00 Died at MAPL_ExtDataInterpField l. 3240
OFF OFF ON ON DIED @ 01:00 Segmentation fault
Finished* = The run finished normally and saved out all output files,
and printed out timing info down to ExtData...but also printed this
message to the stderr output.
Backtrace for this error:
#0 0x150792b682da in ???
#1 0x150792b67503 in ???
... etc ...
#19 0xffffffffffffffff in ???
--------------------------------------------------------------------------
mpirun has exited due to process rank 3 with PID 0 on
node ip-172-31-93-224 exiting improperly. There are three reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
3. this process called "MPI_Abort" or "orte_abort" and the mca parameter
orte_create_session_dirs is set to false. In this case, the run-time cannot
detect that the abort call was an abnormal termination. Hence, the only
error message you will receive is this one.
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
You can avoid this message by specifying -quiet on the mpirun command line.
--------------------------------------------------------------------------
I guess we shouldn't worry about this too much if the runs can consistently work within a container. That may be the best solution going forward.
Thanks for the thorough tests. Probably mark it as a long-term issue... Might have problems for other diagnostics collections as well. Turning on all collections clearly crashes the run https://github.com/geoschem/gchp/issues/12#issuecomment-447531664; not sure which one is causing the problem.
I believe that the root cause of this issue is #15.
I am closing this issue since it is resolved with MAPL update in 12.5.
After the fix https://github.com/geoschem/gchp/issues/6#issuecomment-447475255 there remains one issue: with both
StateMet_avg
andStateMet_inst
turned on, the run crashes at 01:00 when writting the first diagnostics.Tested with GC version
12.1.1
and GCHP branchbugfix/GCHP_issues
, and OpenMPI3.Those cases can finish and print full timing info:
StateMet_inst
: run_StateMet_inst.logStateMet_avg
: run_StateMet_avg.logStateMet_avg
SpeciesConc_inst
SpeciesConc_avg
: run_3diag.logThose cases crash:
StateMet_inst
andStateMet_avg
: run_StateMet_both.logStateMet_inst
StateMet_avg
SpeciesConc_inst
SpeciesConc_avg
: run_4diag.logNot a critical problem as I can just turn off
StateMet
for either tutorial or benchmark purpose. Will proceed to make the tutorial AMI.