Closed JiaweiZhuang closed 5 years ago
To summarize current status, after switching from MPICH 3.3 to OpenMPI 2.1:
I suggest seeing if this issue goes away after upgrading to OpenMPI 3 since switching from OpenMPI 2 to OpenMPI 3 on the Odyssey cluster fixed this same issue after a switch to a new operating system (CentOS6 to CentOS7).
With OpenMPI 3.1.3:
Because OpenMPI 3 seems the most robust configuration right now, I suggest focusing on fixing its timing issue, and put aside MPICH3 and OpenMPI2 for now.
I made another AMI ami-01074a30392daa0f9
with OpenMPI 3.1.3
cd ~/tutorial/gchp_standard
mpirun -np 6 -oversubscribe ./geos
-oversubscribe
is needed for OpenMPI 3 runtime when the # of physical cores is less the # of MPI processes.
Thanks Jiawei, I am closing this issue since switching to OpenMPI 3 fixed the file write hang issue. The issue for incomplete timing info at the end of the run is still open and will be tracked separately in https://github.com/geoschem/gchp/issues/6.
I think we should put a warning on wiki regarding issues with MPICH3 and OpenMPI2. On a lot of shared systems users cannot install whatever MPI they want, unlike the cloud.
Yes, I completely agree. This will be done this week.
-- Lizzie Lundgren Scientific Programmer GEOS-Chem Support Team geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu http://wiki.geos-chem.org/GEOS-Chem_Support_Team
Please direct all GEOS-Chem support issues to the entire GEOS-Chem Support Team at geos-chem-support@as.harvard.edumailto:geos-chem-support@as.harvard.edu. This will allow us to serve you better.
From: Jiawei Zhuang notifications@github.com Reply-To: geoschem/gchp reply@reply.github.com Date: Thursday, December 13, 2018 at 12:07 PM To: geoschem/gchp gchp@noreply.github.com Cc: "Lundgren, Elizabeth W" elundgren@seas.harvard.edu, State change state_change@noreply.github.com Subject: Re: [geoschem/gchp] Early termination at different points depending on diagnostics configuration (#10)
I think we should put a warning on wiki regarding issues with MPICH3 and OpenMPI2. On a lot of shared systems users cannot install whatever MPI they want, unlike the cloud.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_geoschem_gchp_issues_10-23issuecomment-2D447045826&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=ufHnU7-e1-NW4fL96VAHJJr174h3vxgazhrtxKGJQgQ&s=6f6u1oFzPqSf5cHG9HpJ-G4vEvyyvSDIQMDrU2bMudA&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AAnyq09cMu5zvuDdK38ZwWlzKIuWHNuAks5u4okugaJpZM4ZQdu1&d=DwMCaQ&c=WO-RGvefibhHBZq3fL85hQ&r=xyVOGV-4mAPz62S8RZON4khwZesGKcGg2_BHL4y5NjQ&m=ufHnU7-e1-NW4fL96VAHJJr174h3vxgazhrtxKGJQgQ&s=iItnQopIpWq1nE8tf4Lgj-HQIBlxmQGvgfteKIe-xZw&e=.
Trying to summarize different behaviors in #6 #8 #9 as things are getting messy...
1. Crashes at the first time step (at 00:10)
SpeciesConc_inst
with only two speciesSpeciesConc_NO
andSpeciesConc_O3
in it.Typical error message is
2. Crashes when writing the first diagnostics file (at 01:00)
SpeciesConc_avg
SpeciesConc_inst
StateMet_avg
StateMet_inst
with hundreds of variables.Typical error message is
Don't seem a memory problem; still happens on
r5.4xlarge
with 128 GB RAM.3. Crashes right before printing timing information
SpeciesConc_inst
with hundred of speciesSpeciesConc_inst
andSpeciesConc_avg
with hundred of default speciesStateMet_avg
andStateMet_inst
with default variables. This means each single collection won't cause the no. 2 problem.v2018-11
restart file.By tweaking the amount of output, I can get more timing info printed, but the run still ends with
BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
:SpeciesConc_inst
andSpeciesConc_avg
and each with two species, the model can print bothTimes for GIGCenv
andTimes for GIGCchem
,StateMet_avg
andStateMet_inst
and default variables, the model can printTimes for GIGCenv
Times for GIGCchem
Times for DYNAMICS
Times for GCHP
. ButTimes for HIST
andTimes for EXTDATA
are still missing.4. Run to the end with the full timing information printed
This means no error message occurs. The model prints all the way down to
Times for EXTDATA
and creates the newcap_restart
files.I can only make this happen with very tricky configurations:
SpeciesConc_inst
andSpeciesConc_avg
with hundred of default species, and turning off emission in input.geos. (Note that turning off transport usingrunConfig.sh
has no impact on the error).Log file for this only-successful-run so far: run_two_collections_emission_off.log
Environment
All tests are performed with MPICH 3.3 and gcc 7.3.0 on Ubuntu 18.04 (
ami-0a5973f14aad7413a
).I also have OpenMPI 2.1 working (scripts).
I consider this even worse because basically no diagnostics can be archived. With MPICH at least it is functioning in some cases.