MPAS-Dev / compass

Configuration Of MPAS Setups
Other
12 stars 37 forks source link

EC30to60 performance test hanging on Chrysalis with Gnu and OpenMPI #500

Closed xylar closed 1 year ago

xylar commented 1 year ago

There is no error message but the simulation never starts See:

/lcrc/group/e3sm/ac.xylar/compass_1.2/chrysalis/test_20230111/ocean_pr_intel_gnu/ocean/global_ocean/EC30to60/PHC/performance_test/forward
xylar commented 1 year ago

Same on Chicoma in latest testing

mark-petersen commented 1 year ago

I can confirm this behavior on chicoma. In the pr test suite I see:

00:00 PASS ocean_global_ocean_EC30to60_mesh
00:00 PASS ocean_global_ocean_EC30to60_PHC_init
115:36 FAIL ocean_global_ocean_EC30to60_PHC_performance_test

This one also has trouble:

ocean/isomip_plus/planar/2km/z-star/Ocean0
  * step: process_geom
  * step: planar_mesh
  * step: cull_mesh
  * step: initial_state
  * step: ssh_adjustment

It appears to hang on this line in the log file, but sometimes recovers.

 Reading namelist from file namelist.ocean

Watching the log file, it takes about 10 minutes to get through reading the namelist, which should just take a few seconds. This appears to be an i/o problem. I get the same behavior by simply running the srun command, so this is unrelated to any compass interface.

Also hangs for several minutes here, again indicating an i/o problem:

  ** Attempting to bootstrap MPAS framework using stream: mesh
 Bootstrapping framework with mesh fields from input file 'adjusting_init.nc'
mark-petersen commented 1 year ago

On chrsalis, this simply hangs at this point in the log file:

ocean/global_ocean/EC30to60/PHC/performance_test
  * step: forward

pwd
/lcrc/group/e3sm/ac.mpetersen/scratch/runs/n/ocean_model_230322_c9201a4f_ch_gfortran_openmp_test_compass_EC/ocean/global_ocean/EC30to60/PHC/performance_test/forward

(dev_compass_1.2.0-alpha.5) chr:forward$ tail -f log.ocean.0000.out
WARNING: Variable avgTotalFreshWaterTemperatureFlux not in input file.
WARNING: Variable tidalPotentialEta not in input file.
WARNING: Variable nTidalPotentialConstituents not in input file.

On perlmutter it failed and then hangs on the ECwISC30to60:

pm:ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC$ cr
ocean/global_ocean/EC30to60/mesh
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/EC30to60/PHC/init
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/EC30to60/PHC/performance_test
  * step: forward
      Failed
  test execution:      ERROR
  see: case_outputs/ocean_global_ocean_EC30to60_PHC_performance_test.log
  test runtime:        00:10
ocean/global_ocean/ECwISC30to60/mesh
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/ECwISC30to60/PHC/init
  test execution:      SUCCESS
  test runtime:        00:00
ocean/global_ocean/ECwISC30to60/PHC/performance_test
  * step: forward

The ocean/global_ocean/EC30to60/PHC/performance_test ends here in the log file:

pwd
/pscratch/sd/m/mpeterse/runs/n/ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC/ocean/global_ocean/EC30to60/PHC/performance_test/forward

1141 WARNING: Variable filteredSSHGradientMeridional not in input file.
1142 WARNING: Variable avgTotalFreshWaterTemperatureFlux not in input file.
1143 WARNING: Variable tidalPotentialEta not in input file.
1144 WARNING: Variable nTidalPotentialConstituents not in input file.
1145 WARNING: Variable RediKappaData not in input file.

and the ocean/global_ocean/ECwISC30to60/PHC/performance_test hangs here in the log file:

pwd
/pscratch/sd/m/mpeterse/runs/n/ocean_model_230322_c9201a4f_lo_gnu-cray_openmp_compassPR_EC/ocean/global_ocean/ECwISC30to60/PHC/performance_test/forward

tail -n 5 log.ocean.0000.out
WARNING: Variable landIceDraft not in input file.
WARNING: Variable landIceFreshwaterFlux not in input file.
WARNING: Variable landIceHeatFlux not in input file.
WARNING: Variable heatFluxToLandIce not in input file.
WARNING: Variable tidalPotentialEta not in input file.
xylar commented 1 year ago

@mark-petersen, do you think we just need to generate a more up-to-date cached mesh and initial condition? It seems worth a try. If that works, it would be a huge relief!

xylar commented 1 year ago

I can at least try that right now.

xylar commented 1 year ago

I ran the EC test cases without cached mesh and init, and I still get the hanging on Chrysalis with gnu and OpenMI. I'm trying ECwISC but I expect to find the same. So it's nothing to do with important missing variables in the initial condition, I think. Those warnings are a red herring.

xylar commented 1 year ago

Yep, same for ECwISC.

xylar commented 1 year ago

I have used git bisect together with adding a timeout to the model run call to trace this back to https://github.com/E3SM-Project/E3SM/pull/5120.

xylar commented 1 year ago

Using print statements, I have traced the problem to: https://github.com/E3SM-Project/E3SM/blob/master/components/mpas-framework/src/framework/add_field_indices.inc#L33 and https://github.com/E3SM-Project/E3SM/blob/master/components/mpas-framework/src/framework/mpas_dmpar.F#L745 for the variable RediKappaData.

This is very strange! It seems that an MPI_Allreduce is hanging. I don't see any changes in https://github.com/E3SM-Project/E3SM/pull/5120 than explain this.

Even more frustrating, it happens only in optimized mode. In debug mode, everything seems fine.

xylar commented 1 year ago

@mark-petersen, any thoughts?

sarats commented 1 year ago

Just a thought, is any data/initialization/flag expected to be shared among threads that is missing? It looks like the MPI_Allreduce is issued by thread 0. To rule out threading related issue, you can try running this in pure MPI mode.

https://github.com/E3SM-Project/E3SM/blob/4deb2611a4293fdb578db5dd1ba9fd7a6c223029/components/mpas-framework/src/framework/mpas_dmpar.F#L743

xylar commented 1 year ago

@sarats, great suggestion! I had only tried with multiple threads so far. I'll try with 1 thread per core and see if the problem persists.

xylar commented 1 year ago

@sarats, I tested again without OpenMP support, but the hanging behavior remains.

I also looked at the configuration I've been running and it was already with a single thread before so it seems unlikely to be a threading issue. Even so, thank you for the suggestion. it's good that we seem to have eliminated that particular possibility.

mark-petersen commented 1 year ago

OK, I figured it out. It's actually the variable RediKappaData that is causing the problem. That variable is declared in the Registry but is never actually used. So I'm guessing that the compiler, in its optimizing exuberance, got rid of some underlying information about the array, and then the MPI communication hangs when it communicates the size of the array.

I was able to reproduce the error just after the merge of https://github.com/E3SM-Project/E3SM/pull/5120 but the error does not occur just before. Once I removed RediKappaData everything works fine, without the hang. I was testing on chrysalis with EC30to60 performance tests, here:

/lcrc/group/e3sm/ac.mpetersen/scratch/runs/ocean_model_230404_c5f8b378_ch_gfortran_openmp_after_5120/ocean/global_ocean/EC30to60/PHC/performance_test/forward

My theory on the cause does not explain why the innocent-looking https://github.com/E3SM-Project/E3SM/pull/5120 would cause this. I can only say that the fix works, and compiler optimization is a finicky business.

I will post a bug report and bug fix to E3SM tomorrow.

sarats commented 1 year ago

optimizing exuberance

Mark: Just curious - what was the optimization level used when it hangs? Was it O3 or even lower?

mark-petersen commented 1 year ago

It was with O3.

xylar commented 1 year ago

Appears indeed to be fixed by https://github.com/E3SM-Project/E3SM/pull/5575. I will close this issue once that PR has been merged and the E3SM-Project submodule here has been updated.