Closed xylar closed 1 year ago
Same on Chicoma in latest testing
I can confirm this behavior on chicoma. In the pr test suite I see:
00:00 PASS ocean_global_ocean_EC30to60_mesh
00:00 PASS ocean_global_ocean_EC30to60_PHC_init
115:36 FAIL ocean_global_ocean_EC30to60_PHC_performance_test
This one also has trouble:
ocean/isomip_plus/planar/2km/z-star/Ocean0
* step: process_geom
* step: planar_mesh
* step: cull_mesh
* step: initial_state
* step: ssh_adjustment
It appears to hang on this line in the log file, but sometimes recovers.
Reading namelist from file namelist.ocean
Watching the log file, it takes about 10 minutes to get through reading the namelist, which should just take a few seconds. This appears to be an i/o problem. I get the same behavior by simply running the srun
command, so this is unrelated to any compass interface.
Also hangs for several minutes here, again indicating an i/o problem:
** Attempting to bootstrap MPAS framework using stream: mesh
Bootstrapping framework with mesh fields from input file 'adjusting_init.nc'
On chrsalis, this simply hangs at this point in the log file:
ocean/global_ocean/EC30to60/PHC/performance_test
* step: forward
pwd
/lcrc/group/e3sm/ac.mpetersen/scratch/runs/n/ocean_model_230322_c9201a4f_ch_gfortran_openmp_test_compass_EC/ocean/global_ocean/EC30to60/PHC/performance_test/forward
(dev_compass_1.2.0-alpha.5) chr:forward$ tail -f log.ocean.0000.out
WARNING: Variable avgTotalFreshWaterTemperatureFlux not in input file.
WARNING: Variable tidalPotentialEta not in input file.
WARNING: Variable nTidalPotentialConstituents not in input file.
On perlmutter it failed and then hangs on the ECwISC30to60
:
pm:ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC$ cr
ocean/global_ocean/EC30to60/mesh
test execution: SUCCESS
test runtime: 00:00
ocean/global_ocean/EC30to60/PHC/init
test execution: SUCCESS
test runtime: 00:00
ocean/global_ocean/EC30to60/PHC/performance_test
* step: forward
Failed
test execution: ERROR
see: case_outputs/ocean_global_ocean_EC30to60_PHC_performance_test.log
test runtime: 00:10
ocean/global_ocean/ECwISC30to60/mesh
test execution: SUCCESS
test runtime: 00:00
ocean/global_ocean/ECwISC30to60/PHC/init
test execution: SUCCESS
test runtime: 00:00
ocean/global_ocean/ECwISC30to60/PHC/performance_test
* step: forward
The ocean/global_ocean/EC30to60/PHC/performance_test
ends here in the log file:
pwd
/pscratch/sd/m/mpeterse/runs/n/ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC/ocean/global_ocean/EC30to60/PHC/performance_test/forward
1141 WARNING: Variable filteredSSHGradientMeridional not in input file.
1142 WARNING: Variable avgTotalFreshWaterTemperatureFlux not in input file.
1143 WARNING: Variable tidalPotentialEta not in input file.
1144 WARNING: Variable nTidalPotentialConstituents not in input file.
1145 WARNING: Variable RediKappaData not in input file.
and the ocean/global_ocean/ECwISC30to60/PHC/performance_test
hangs here in the log file:
pwd
/pscratch/sd/m/mpeterse/runs/n/ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC/ocean/global_ocean/ECwISC30to60/PHC/performance_test/forward
tail -n 5 log.ocean.0000.out
WARNING: Variable landIceDraft not in input file.
WARNING: Variable landIceFreshwaterFlux not in input file.
WARNING: Variable landIceHeatFlux not in input file.
WARNING: Variable heatFluxToLandIce not in input file.
WARNING: Variable tidalPotentialEta not in input file.
@mark-petersen, do you think we just need to generate a more up-to-date cached mesh and initial condition? It seems worth a try. If that works, it would be a huge relief!
I can at least try that right now.
I ran the EC test cases without cached mesh and init, and I still get the hanging on Chrysalis with gnu and OpenMI. I'm trying ECwISC but I expect to find the same. So it's nothing to do with important missing variables in the initial condition, I think. Those warnings are a red herring.
Yep, same for ECwISC.
I have used git bisect
together with adding a timeout to the model run call to trace this back to https://github.com/E3SM-Project/E3SM/pull/5120.
Using print statements, I have traced the problem to:
https://github.com/E3SM-Project/E3SM/blob/master/components/mpas-framework/src/framework/add_field_indices.inc#L33
and
https://github.com/E3SM-Project/E3SM/blob/master/components/mpas-framework/src/framework/mpas_dmpar.F#L745
for the variable RediKappaData
.
This is very strange! It seems that an MPI_Allreduce
is hanging. I don't see any changes in https://github.com/E3SM-Project/E3SM/pull/5120 than explain this.
Even more frustrating, it happens only in optimized mode. In debug mode, everything seems fine.
@mark-petersen, any thoughts?
Just a thought, is any data/initialization/flag expected to be shared among threads that is missing? It looks like the MPI_Allreduce is issued by thread 0. To rule out threading related issue, you can try running this in pure MPI mode.
@sarats, great suggestion! I had only tried with multiple threads so far. I'll try with 1 thread per core and see if the problem persists.
@sarats, I tested again without OpenMP support, but the hanging behavior remains.
I also looked at the configuration I've been running and it was already with a single thread before so it seems unlikely to be a threading issue. Even so, thank you for the suggestion. it's good that we seem to have eliminated that particular possibility.
OK, I figured it out. It's actually the variable RediKappaData
that is causing the problem. That variable is declared in the Registry but is never actually used. So I'm guessing that the compiler, in its optimizing exuberance, got rid of some underlying information about the array, and then the MPI communication hangs when it communicates the size of the array.
I was able to reproduce the error just after the merge of https://github.com/E3SM-Project/E3SM/pull/5120 but the error does not occur just before. Once I removed RediKappaData
everything works fine, without the hang. I was testing on chrysalis with EC30to60
performance tests, here:
/lcrc/group/e3sm/ac.mpetersen/scratch/runs/ocean_model_230404_c5f8b378_ch_gfortran_openmp_after_5120/ocean/global_ocean/EC30to60/PHC/performance_test/forward
My theory on the cause does not explain why the innocent-looking https://github.com/E3SM-Project/E3SM/pull/5120 would cause this. I can only say that the fix works, and compiler optimization is a finicky business.
I will post a bug report and bug fix to E3SM tomorrow.
optimizing exuberance
Mark: Just curious - what was the optimization level used when it hangs? Was it O3
or even lower?
It was with O3.
Appears indeed to be fixed by https://github.com/E3SM-Project/E3SM/pull/5575. I will close this issue once that PR has been merged and the E3SM-Project submodule here has been updated.
There is no error message but the simulation never starts See: