Closed fvitt closed 8 months ago
@sjsprecious When using the mpibind script some of my jobs fail with these errors:
cat: '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
rm: cannot remove '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
Could this be caused by running several test jobs simultaneously?
@sjsprecious When using the mpibind script some of my jobs fail with these errors:
cat: '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory rm: cannot remove '/glade/derecho/scratch/fvitt/tmp/mpibind.log.*.tmp': No such file or directory
Could this be caused by running several test jobs simultaneously?
It could be. However, the error message indicates that it is just trying to remove some temporary files from the tmp folder, which should not fail your test directly. I will forward this issue to Rory and see if he has a better insight here.
@fvitt CISL has updated the mpibind
script to address your issue. Let me know if it works for your simulations. Thanks.
@jedwards4b and @sjsprecious I am having issues with these WACCMX ERP tests hanging with mpibind.
ERP_D_Ln9_P256x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
This CAM ERP test passes with mpibind:
ERP_D_Ln9_P256x4.f09_f09_mg17.F2000climo.derecho_intel.cam-outfrq9s
The above ERP FX2000 tests pass when I use the binding arguments to mpiexec...
@fvitt I've let Rory know - can you provide any details about where it's hanging?
Hi @fvitt , when you say "hanging", do you mean your job never runs on Derecho or it dies without a clear error message?
Can you send me the path to the output logs on Derecho so that I can take a look?
In addition, is running <128 CPU cores per node on Derecho still a problem for you?
@jedwards4b and @sjsprecious The initial run of the test completes okay. The tests hang on the "case2run" portion of the test where the ntasks and nthreads are halved. The second run hangs and errors out after hitting the wall clock limit.
For example see:
/glade/derecho/scratch/fvitt/ERP_Ln9_P512x4.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s.cesm22_test/run/case2run
Can you run, for example, an SMS run of this compset with the pelayout of case2?
Can you run, for example, an SMS run of this compset with the pelayout of case2?
Yes, this passed: SMS_Ln9_P256x2.f09_f09_mg17.FX2000.derecho_intel.cam-outfrq9s
Hi @fvitt, thanks for sharing the case directory. I looked at the .case.test
file and the PBS resources is specified as select=16:ncpus=128:mpiprocss=32:ompthreafds=4
. Therefore, when you perform an ERP test and ntasks/nthreads are halved, you will again undersubscribe a full node. This reminds me of the problem you reported before about using 36 CPU cores per node (https://github.com/NCAR/mpibind/issues/4) and according to Rory's reply, mpibind
can not handle this case properly. This is my naive explanation of the ERP failure here but @jedwards4b may have more insights. Also this does not explain why F2000climo
passes while FX2000
fails. 😕
In addition, is running <128 CPU cores per node on Derecho still a problem for you?
Yes this is still a problem.
I think the < 128 one is on me, I need to modify config_batch.xml to handle this case.
@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128.
@fvitt please update your cime branch maint-5.8_5.8.32 and try again on < 128.
This passes now with the latest in maint-5.8_5.8.32 branch.
Could it be that our setting of OMP_STACKSIZE is not being used by mpibind?
Enable threading on derecho. Remove the default setting of MPICH_MPIIO_HINTS on derecho which seems to degrade performance of IO for normal resolution configurations.
Test suite:
Test baseline:
Test namelist changes: N/A
Test status: bit for bit unchanged