ESMCI / cime

Common Infrastructure for Modeling the Earth
http://esmci.github.io/cime
Other
161 stars 206 forks source link

Enable threading and improve IO performance for normal resolutions for cesm2.2 on derecho #4559

Closed fvitt closed 8 months ago

fvitt commented 8 months ago

Enable threading on derecho. Remove the default setting of MPICH_MPIIO_HINTS on derecho which seems to degrade performance of IO for normal resolution configurations.

Test suite:

  PASS ERC_D_Ln9.ne16_ne16_mg17.QPC5HIST.derecho_intel.cam-outfrq3s_usecase
  PASS ERC_Ln9_P64x4.f19_f19_mg17.QPC6.derecho_intel.cam-outfrq9s
  PASS ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
  PASS SMS_D_Ln9.f09_f09_mg17.FCvbsxHIST.derecho_intel.cam-outfrq9s
  PASS SMS_D_Ln9.f19_f19_mg17.FWma2000climo.derecho_intel.cam-outfrq9s
  PASS SMS_D_Ln9_P512x4.f19_f19_mg17.FX2000.derecho_intel.cam-outfrq9s
  PASS SMS_Ld1.ne30pg3_ne30pg3_mg17.FC2010climo.derecho_intel.cam-outfrq1d
  PASS SMS_Lm13.f10_f10_mg37.F2000climo.derecho_intel.cam-outfrq1m

Test baseline:

    PASS ERC_D_Ln9.ne16_ne16_mg17.QPC5HIST.derecho_intel.cam-outfrq3s_usecase BASELINE cam_cesm2_2_rel_09:
    PASS SMS_D_Ln9.f09_f09_mg17.FCvbsxHIST.derecho_intel.cam-outfrq9s BASELINE cam_cesm2_2_rel_09:
    PASS SMS_D_Ln9.f19_f19_mg17.FWma2000climo.derecho_intel.cam-outfrq9s BASELINE cam_cesm2_2_rel_09:
    PASS SMS_Ld1.ne30pg3_ne30pg3_mg17.FC2010climo.derecho_intel.cam-outfrq1d BASELINE cam_cesm2_2_rel_09:
    PASS SMS_Lm13.f10_f10_mg37.F2000climo.derecho_intel.cam-outfrq1m BASELINE cam_cesm2_2_rel_09:

Test namelist changes: N/A

Test status: bit for bit unchanged

fvitt commented 8 months ago

@sjsprecious When using the mpibind script some of my jobs fail with these errors:

cat: '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
rm: cannot remove '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory

Could this be caused by running several test jobs simultaneously?

sjsprecious commented 8 months ago

@sjsprecious When using the mpibind script some of my jobs fail with these errors:

cat: '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
rm: cannot remove '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory

Could this be caused by running several test jobs simultaneously?

It could be. However, the error message indicates that it is just trying to remove some temporary files from the tmp folder, which should not fail your test directly. I will forward this issue to Rory and see if he has a better insight here.

sjsprecious commented 8 months ago

@fvitt CISL has updated the mpibind script to address your issue. Let me know if it works for your simulations. Thanks.

fvitt commented 8 months ago

@jedwards4b and @sjsprecious I am having issues with these WACCMX ERP tests hanging with mpibind.

  ERP_D_Ln9_P256x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
  ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s

This CAM ERP test passes with mpibind:

ERP_D_Ln9_P256x4.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s

The above ERP FX2000 tests pass when I use the binding arguments to mpiexec...

jedwards4b commented 8 months ago

@fvitt I've let Rory know - can you provide any details about where it's hanging?

sjsprecious commented 8 months ago

Hi @fvitt , when you say "hanging", do you mean your job never runs on Derecho or it dies without a clear error message?

Can you send me the path to the output logs on Derecho so that I can take a look?

In addition, is running <128 CPU cores per node on Derecho still a problem for you?

fvitt commented 8 months ago

@jedwards4b and @sjsprecious The initial run of the test completes okay. The tests hang on the "case2run" portion of the test where the ntasks and nthreads are halved. The second run hangs and errors out after hitting the wall clock limit.

For example see:

/glade/derecho/scratch/fvitt/ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s.cesm22_test/run/case2run
jedwards4b commented 8 months ago

Can you run, for example, an SMS run of this compset with the pelayout of case2?

fvitt commented 8 months ago

Can you run, for example, an SMS run of this compset with the pelayout of case2?

Yes, this passed: SMS_Ln9_P256x2.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s

sjsprecious commented 8 months ago

Hi @fvitt, thanks for sharing the case directory. I looked at the .case.test file and the PBS resources is specified as select=16:ncpus=128:mpiprocss=32:ompthreafds=4. Therefore, when you perform an ERP test and ntasks/nthreads are halved, you will again undersubscribe a full node. This reminds me of the problem you reported before about using 36 CPU cores per node (https://github.com/NCAR/mpibind/issues/4) and according to Rory's reply, mpibind can not handle this case properly. This is my naive explanation of the ERP failure here but @jedwards4b may have more insights. Also this does not explain why F2000climo passes while FX2000 fails. 😕

fvitt commented 8 months ago

In addition, is running <128 CPU cores per node on Derecho still a problem for you?

Yes this is still a problem.

jedwards4b commented 8 months ago

I think the < 128 one is on me, I need to modify config_batch.xml to handle this case.

jedwards4b commented 8 months ago

@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128.

fvitt commented 8 months ago

@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128.

This passes now with the latest in maint-5.8_5.8.32 branch.

fvitt commented 8 months ago

Could it be that our setting of OMP_STACKSIZE is not being used by mpibind?