Closed ndkeen closed 1 year ago
Hi @ndkeen , What create_newcase command were you using? I'd like to look at this on Titan, for comparison.
Think that I found it in the performance_archive - please verify though.
Well, it was a coupled-hires case that used run_acme script.
README.case:2017-06-05 06:11:55: /global/cscratch1/sd/ndk/wacmy/ACME_code/2017-master-jun1/cime/scripts/create_newcase --case /global/cscratch1/sd/ndk/ACME_simulations/2017-master-jun1.hmod1650.master-jun1.nc.nr.st.ne120np4_oRRS18to6v3_ICG.cori-knl/case_scripts --mach cori-knl --compset A_WCYCL2000_H01AS --res ne120np4_oRRS18to6v3_ICG --walltime 8:00:00 --project acme --pecount M
With changes after case created, of course.
Got it. Thanks.
Cost on Titan is also "high" (40 min. is a lot, though not the 68 min. seen on Cori-KNL):
"cesm_init" - 1 - 2425.067627
but
"a_i:PIO:pio_rearrange_create_box" - 6 - 237.736053
is 1/5 the cost on Titan as it is on Cori-KNL (unless this is capturing some load imbalance from elsewhere). FYI.
Titan experiments may not be completely relevant to Cori-KNL, but moving from 86400x1 to 43200x2 in ATM and making similar changes in other components decreased A_WCYCL initialization cost from 2040 seconds to 740 seconds. Similarly, changing from 86400x1 to 43200x1 decreased the cost to 1023 seconds, so some of the high initialization cost for 86400x1 on Titan was due to using 16 MPI tasks per node, but most was an algorithmic scaling issue. I'll see if I can figure identify where this is coming from. Again, Titan does not see a high pio_rearrange_create_box, so some of the Cori-KNL overhead is Cori-KNL-specific.
@ndkeen , what is the latest on this on Cori-KNL? Does using threading (and fewer MPI processes) eliminate the problem? Are you still interested in diagnosing the 86400x1 ATM processes initialization performance issue?
My Titan jobs are taking awhile to get scheduled, so will jump to Cori-KNL as well, if you can provide a reproducer. I'm guessing that an F case would be sufficient? In any case, please advise.
Looking over several runs at the timer comp_init_cc_atm
, I see the following.
The number of nodes in first column, and number of MPI x threads in other columns.
Certainly true that threads are having a large impact on performance for init, which I had not realized until I made this table. Although the two measurements with 675 nodes shows decrease with 2 threads over 1 thread.
nodes 64x1 64x2 64x4 32x4 32x8
85 83 102
169 130 143 173
170 227 281
318
338 333 493
675 1003 1046
1350 3094 3216 1098 1100
2700 4971
To reproduce, should be as simple as:
-compset FC5AV1C-H01A -res ne120_ne120
The current default sets -c 4. To try other values, have to change config_machines.xml.
Looking at the raw timings for 86400x1, around half of the comp_init_cc_atm time is unattributed. I'll need to run this case myself with some additinal timers added.
@ndkeen , I'll repeat your earlier run, using the latest master with some additional instrumentation tweaks.
I tried a few experiments with nothing gained. I ran 3 more 675-node cases with 1 thread. Also note that I had the wrong results in the above table for 675-nodes. I had listed the total init time, not the ATM only (I edited). I will try to plot the time as a function of MPI tasks which will probably show a better story.
1) I simply ran again with a more recent version of master 867 s 2) I ran that case again using sbcast -- a slurm tactic to speed up getting the executable out to the ranks, but may also be testing something else less well-understood -- no real change. 925 s 3) Run again with Intel MPI (impi). 998 s
SMS.ne120_ne120.m39n675p43200t01.sbcast
SMS.ne120_ne120.m39n675p43200t01.impi2017
Pat pointed out to me that the experiments I tried with a different MPI version, seem to still be using cray-mpich. I tried a few other things, then realized NERSC needs us to use a different method. The cray wrappers ftn,cc,CC
are, of course, going to point to cray MPI. So to use Intel MPI (or openmpi), we need to use the compilers mpiifort, mpiicc, mpiicpc
as well as some I_MPI_*
env variables. So I went to the trouble of creating a new compiler intel-impi
to try this experiment. It does at least build.
Pat, would you mind describing in a sentence or two what you think you are seeing regarding the slow down? Something we can send to NERSC just in case they have seen this before? Or know something to try? Is it that this extra time is being seen at the first MPI point-to-point?
Pat pointed out to me that the experiments I tried with a different MPI version, seem to still be using cray-mpich.
A little strong - just noted that in my experiments, following @ndkeen 's lead, I can't tell what version of MPI I am using based on the build and run logs. Since performance in my one experiment (using openmpi) was very similar to that when using cray-mpich, I just wanted to verify that I was comparing what I thought I was comparing.
Noel, I'll provide a short description of what I am seeing in the near future. I am traveling at the moment.
I managed to get a run going with Intel MPI. It got stuck at shutdown down and failed, but not before it wrote out a timing.tar file which allows me to see the time for comp_init_cc_atm
. This was only a test problem using ne30 F case, but the time in that timer is 2x using Intel MPI than cray MPI. I can try this with a ne120 case, but it may not be worth it.
I determined that my OpenMPI jobs were not actually using OpenMPI. I tried @ndkeen 's modifications to use Intel MPI. This built (and was accessing the Intel MPI include files). Run died pretty quickly with:
...
28813: acme.exe: error: slurm_get_kvs_comm_set: Transport endpoint is not connected
...
00611: acme.exe: error: slurm_send_kvs_comm_set: Connection timed out
...
00611: Fatal error in MPI_Init: Other MPI error, error stack:
00611: MPIR_Init_thread(805).......: fail failed
00611: MPID_Init(1831).............: channel initialization failed
00611: MPIDI_CH3_Init(147).........: fail failed
00611: MPID_nem_ofi_post_init(1117): fail failed
00611: getConnInfoKVS(905).........: PMI_KVS_Get failed
00611: In: PMI_Abort(69253391, Fatal error in MPI_Init: Other MPI error, error stack:
00611: MPIR_Init_thread(805).......: fail failed
00611: MPID_Init(1831).............: channel initialization failed
00611: MPIDI_CH3_Init(147).........: fail failed
00611: MPID_nem_ofi_post_init(1117): fail failed
00611: getConnInfoKVS(905).........: PMI_KVS_Get failed)
I'll give up on the other MPI libraries for the moment.
@ndkeen ,
Pat, would you mind describing in a sentence or two what you think you are seeing regarding the slow down?
Not a sentence or two, but here is a summary of the latest. I have lots more details, but this should be enough for you to broach the topic with NERSC if you feel that it is worthwhile to pursue it.
I added some more instrumentation into ACME to help isolate the sources of the poor scaling in the initialization overhead. For my runs,
270 nodes (17280 MPI processes, 64 per node)
"CPL:INIT" 543.889771 seconds
450 nodes (28800 MPI processes, 64 per node)
"CPL:INIT" 874.999878 seconds
675 nodes (43200 MPI processes, 64 per node)
"CPL:INIT" 1420.839233 seconds
It appears that all of the performance loss is in the MPI overhead, apparently in the first time that two processes exchange data for a given communicator, for ATM, for LND, and for ICE in this F case. For example, in the creation of a data type to use in subsequent communication as part of the parallel I/O layer (PIO), sending data from each process to the appropriate PIO read/write process, there is a many-to-some communication pattern. For this pattern the cost for the 675 node run was, for process 0 (not a PIO process) (total over all calls, max per call)
Called Total Max
"a_i:pre-CC B-bar" 6 0.042269 0.037482
"a_i:CC B" 6 280.114380 278.405823
"a_i:post-CC B-bar" 6 173.111191 172.199188
for process 1 (a PIO process)
"a_i:pre-CC B-bar" 6 0.041991 0.037327
"a_i:CC B" 6 281.709167 279.997375
"a_i:post-CC B-bar" 6 171.516434 170.607605
while the data over all processes (max and min) was
wallmax (proc thrd ) wallmin (proc thrd )
"a_i:pre-CC B-bar" 0.130 ( 2945 0) 0.003 ( 37308 0)
"a_i:CC B" 453.217 ( 40894 0) 280.114 ( 0 0)
"a_i:post-CC B-bar" 173.111 ( 0 0) 0.021 ( 24641 0)
Note that only one call (hypthosis is that it is the first) accounted for almost all of the cost out of the 6 times it was called. There was no load imbalance coming into the operator, and quite a bit coming out of it. There were 675 PIO processes in this operator, one per node.
This communication operator was implemented with an MPI point-to-point algorithm. If I instead used MPI_Alltoallv, the cost went to almost zero for this particular timer. However the cost then moved to the next point-to-point implementation of a non-local communication operator. Also, with each replacement the overall cost of CPL:Init generally increased. For example, when replacing all point-to-point calls with collective (MPI_Alltoallv, MPI_Alltoallw, MPI_gather) CPL:Init cost increases to
"CPL:INIT" 2426.001709 seconds
"comp_init_cc_atm" 607.984802
"comp_init_cc_lnd" 1129.624390
"comp_init_cc_ice" 561.643188
as compared to the version with the point-to-point implementation.
"CPL:INIT" 1420.839233 seconds
"comp_init_cc_atm" 879.601685
"comp_init_cc_lnd" 93.860229
"comp_init_cc_ice" 350.327026
Note that the point-to-point implementation was motivated by very poor performance with Cray-optimized versions of MPICH on the Gemini network when MPI_Alltoall was used to implement this many-to-some pattern, especially when there is load imbalance coming into the oeprator. The assumption is that this was due to very large number of unexpected messages causing the degraded performance, even in the MPI library collectives. It also appears to be a useful optimization on the NERSC systems, though not as dramatic as on Titan.
Ran a 1350 node job. This reproduces @ndkeen 's earlier results, but my runs also have some additional barriers, so this further verifies the scaling issues:
270 nodes (17280 MPI processes, 64 per node)
"CPL:INIT" 543.889771 seconds
"comp_init_cc_atm" 282.020813
"comp_init_cc_lnd" 59.025913
"comp_init_cc_ice" 141.156219
450 nodes (28800 MPI processes, 64 per node)
"CPL:INIT" 874.999878 seconds
"comp_init_cc_atm" 483.143494
"comp_init_cc_lnd" 78.226891
"comp_init_cc_ice" 228.905304
675 nodes (43200 MPI processes, 64 per node)
"CPL:INIT" 1420.839233 seconds
"comp_init_cc_atm" 879.601685
"comp_init_cc_lnd" 93.860229
"comp_init_cc_ice" 350.327026
Called Total Max
"a_i:pre-CC B-bar" 6 0.042269 0.037482
"a_i:CC B" 6 280.114380 278.405823
"a_i:post-CC B-bar" 6 173.111191 172.199188
1350 nodes (86400 MPI processes, 64 per node)
"CPL:INIT" 4139.669922 seconds
"comp_init_cc_atm" 2986.752197
"comp_init_cc_lnd" 294.270386
"comp_init_cc_ice" 693.185913
Called Total Max
"a_i:pre-CC B-bar" 6 0.037806 0.029725
"a_i:CC B" 6 1008.000854 1004.3925783
"a_i:post-CC B-bar" 6 650.500305 648.610535
Note that CICE initialization (not MPAS-CICE) also grows, but ATM is the dominant cost. I have not looked at CICE closely (and hope not to have to), but have no reason to believe that this is also not an MPI initialization issue.
Since it was easy, I ran some ne120 F cases on edison (after upgrade). The init times aren't nearly as bad. One run with 21600 MPI's and the other 2 with 43200 (I think I was testing cpu_bind flag or something).
SMS.ne120_ne120.mfedihotHn0900p21600t01/timing/acme_timing_stats.170727-120549.gz:"comp_init_cc_atm" 21600 21600 2.160000e+04 2.685356e+06 124.430 ( 15575 0) 124.272 ( 12227 0)
SMS.ne120_ne120.mfedihotHn1800p43200t01/timing/acme_timing_stats.170727-172203.gz:"comp_init_cc_atm" 43200 43200 4.320000e+04 1.047854e+07 242.902 ( 16478 0) 242.467 ( 15011 0)
SMS.ne120_ne120.mfedihotHn1800p43200t01b/timing/acme_timing_stats.170727-173615.gz:"comp_init_cc_atm" 43200 43200 4.320000e+04 1.111384e+07 257.492 ( 19944 0) 257.155 ( 39246 0)
I have a reproducer that we can pass on to NERSC and Cray if you want. I'll continue to refine this, and maybe make it even smaller. I'll also be using it to examine workarounds.
This just measures the cost of the PIO communication pattern used during the data type creation in box_rearrange_create when there is one PIO process per node (stride 64, offset 1) for two successive calls, surrounded (and separated) by barriers. Time reported is maximum over all processes.
4 nodes (256 processes):
(a) 2.989506721496582E-002
(b) 1.514911651611328E-003
256 nodes (4096 processes):
(a) 58.5745210647583
(b) 0.119852066040039
675 nodes (43200 processes):
(a) 456.856745004654
(b) 0.653519868850708
1024 nodes (65536 processes):
(a) 748.560154914856
(b) 0.525825977325439
1350 nodes (86400 processes):
(a) 1633.26983404160
(b) 1.03820204734802
Hi Pat, I was distracted by the edison upgrade. You have a reproducer? Wow. Very well done. Certainly interested. Should we put it in the repo? I know that can be a lot of work. I did send emails to two folks at NERSC, but have not heard back. With stand-alone code, I can be much more confident to ask for help. Does it surprise you that edison is quicker? Note, it's free to run on edison until July 31st.
I don't know what would be involved with putting it in the repo. Where would it go, and what would it be used for in the future (and how would it be maintained)?
At the moment my driver code references
pio_kinds.F90 pio_support.F90
in cime/src/externals/pio1/pio, and
pio_spmd_utils.F90
from a build directory, generated from pio_spmd_utils.F90_in, also in cime/src/externals/pio1/pio .
For this standalone reproducer, I just copied over these source files, and included instructions for building using these files based on the ACME Cori-KNL module loads and compiler options.
@ndkeen and I had a brief email exchange on creating a performance test suite (e.g. as a cime test suite) mostly for benchmarking our primary configurations, but small test cases/reproducers that hit typically performance-sensitive bits like this would also be useful.
This particular test is targeting an issue that seems to be somewhat peculiar to Cori-KNL, and only occurs during the initialization. A more typical PIO gather or scatter communication test would be different. (Actually that is what I started from - I modified it to use the initialization pattern.)
@jayeshkrishna may already have PIO1 and PIO2 performance tests that include these MPI communication patterns? I'm not competent to modify PIO, or CIME, to add build or run support for these types of tests. I can continue to prototype them, if that would be useful.
I suppose whether or not it goes into the repo (and where) is up to others, but I'm a fan of stand-alone tests as we can do more with them. I would certainly like to try it when you are ready to let me see it.
@ndkeen , please grab
ATM_initialization_cost_scaling_reproducer.tar.gz
from /global/u2/w/worleyph at NERSC. This expands to a directory called
ATM_initialization_cost_scaling_reproducer
There is a Notes.txt file in the directory that should get you started. Further questions are probably best handled in private e-mails until you are ready to hand this off to NERSC.
Hi @jayesh (and cc'ing @ndkeen ),
(Summarizing above material) The latest on the poor initialization scaling on Cori-KNL is that this appears to be occurring the first time two processes communicate, and is showing up when calling swapm in compute_counts in box_rearrange_create (the first time this is called in ATM initialization). If both swapm calls in compute_counts are replaced with calls to MPI_AlltoallV, then this first call overhead disappears from there and reappears in LND initiaization in the first couple of calls to a2a_box_rear_io2comp_double or swapm_box_rear_io2comp_double, i.e. whether using MPI_AlltoAllW OR swapm.
I created two standalone test programs that either gather information from all "compute" tasks to "IO" tasks (comp2io) or scatter information from "IO" tasks to "compute" tasks (io2comp). This verifies the above, that the first call when using swapm or MPI_AlltoAllW has a high cost and scales poorly, and that MPI_AlltoallV performs much better. This does not provide a real workaround since we can't use MPI_AlltoAllW for a2a_box_rear_io2comp_double - this truly is an MPI_AlltoAllW operator unless we stop using MPI data types and pack and unpack buffers manually.
Not that the performance comparison for subsequent calls is more complicated, but does capture the "with handshaking" and "without handshaking" difference in performance for comp2io and io2comp. Since these standalone programs do not capture the actual comp2io and io2comp message patterns in ACME - e.g. not all compute tasks send to all io tasks - they are also not adequate for optimizing settings in ACME, and this still needs to be done in the model.
I started instrumenting compute_counts to output the different message patterns created in PIO so that I could sumarize for @ndkeen and perhaps further generalize the standalone test prgrams. However I just remembered that you have working on PIO test programs that also use extracted communication patterns - perhaps these are what Noel and I should be using, or perhaps you can help us in this testing on Cori-KNL.
At this point, the only solution I can think of is to tell NERSC and Cray about the high start-up cost in swapm and MPI_AlltAllW and ask them to see if they can do for these whatever magic they did for MPI_AlltoallV. However, having an accurate PIO test program may allow us to experiment with our own workarounds (since I am not holding my breath that Cray will address our request in a timely fashion).
Thanks Pat. Indeed I have been taking the stand-alone tests prepared by Pat and running them with various parameters (inputs, num MPI's, MPI env flags, etc). I'm trying to a) learn more about this issue in general b) try quick/easy experiments that could be great work-arounds if they work c) "package up" in a way that we can send to NERSC for additional help d) also trying to obtain more profiling data using craypat
I was derailed yesterday due to the power outage and now I need to work on some other things, but will soon get back to it. Jayesh: please let me know if there's something else I should be doing/trying.
Using master from Oct 10, I ran several more F compsets. Pat already discovered that setting PIO_ROOT=0 was a benefit to the ATM init time. I was already using that in several other experiments, but not for these ATM init runs. So these are with a new repo and with:
./xmlchange --file env_run.xml PIO_ROOT=0
But it does NOT include the changes that Pat made in a branch.
675 nodes, 43200 MPI's, 2 threads:
"comp_init_cc_atm" 43200 43200 4.320000e+04 3.303537e+07 764.829 ( 3471 0) 764.613 ( 15367 0)
1350 nodes, 86400 MPI's, 2 threads:
"comp_init_cc_atm" 86400 86400 8.640000e+04 1.844306e+08 2134.710 ( 62805 0) 2134.565 ( 74474 0)
1350 nodes, 86400 MPI's, 1 thread:
"comp_init_cc_atm" 86400 86400 8.640000e+04 1.975688e+08 2286.824 ( 74283 0) 2286.572 ( 23764 0)
That's 35 minutes which is quite an improvement already.
FYI PIO_ROOT == 0 seems to perform better on Titan and on Anvil as well, both low and high resolution cases, and both for initialization and for the run loop.
That's great. Is @wlin7 using this change?
@PeterCaldwell , I have a branch with two changes to address this. The first (setting PIO_ROOT to zero by default) is noncontroversial. The second may be specific to Cori-KNL. I'll split these into 2 branches and submit a pull request for the first one later today.
Great, thanks for your work on this!
@ndkeen , I have "found" another way to address this issue:
MPICH_GNI_DYNAMIC_CONN
By default, connections are set up on demand. This allows
for optimal performance while minimizing memory
requirements. If set to enabled, dynamic connections are
enabled; if set to disabled, MPICH establishes all
connections at job startup, which may require significant
amounts of memory.
I am running experiments with
setenv MPICH_GNI_DYNAMIC_CONN disabled
and initialization appears to be much faster. Setting PIO_ROOT to zero still seems to be a good idea, but the above may be the real solution for us, especially since at high process counts we do not have a large memory footprint and can afford the extra MPI memory?
One concern is that the initialization cost is pushed to MPI_Init, which we do not measure directly. The CaseStatus timing data also includes pre and post processing times, so makes it difficult to pull out just the srun time. I've created another github issue ( #1857 ) to get this fixed, but don't have the fix available for these studies yet.
Another concern is whether the RUN_LOOP performance is degraded by doing this. The discussion above does not imply that it would be, but we still need to be sure.
You might try the above in some of your experiments as well, to start evaluating the impact. The nice thing about this solution is that it is system-specific, and we would not impact performance on any other system.
You should also ask your NERSC and Cray contacts about this, and why they did not suggest it to us when we first asked them about mitigation techniques. Sure seems to be relevant.
Yea, sorry Pat, in fact, this was what Nathan W (of Cray) did suggest. I thought I had noted it somewhere but can't find it. He was not optimistic it would help because, as you noted, the time might just show up in MPI_Init(). It was still on my list of things to try.
And, indeed the time to do whatever is happening before or after the acme.exe cost should be measured. This time is too high for runs on KNL and is wasting MPP's at high scales. I've not had any luck getting anyone to help with this. We just simply can't do those tasks on the compute nodes. The batch file should ONLY be the srun command.
Okay, I'm trying now. Looks to be working, but I need to figure out how to determine the impact on MPI_Init. Allocating space and setup at once versus reacting every time a message comes from a new process should be faster, and perhaps much faster. I am still hopeful.
@ndk, it appears to somewhat advantageous to set MPICH_GNI_DYNAMIC_CONN=disabled. By looking in the job script output, e.g.
2017-10-20 17:21:34 MODEL EXECUTION BEGINS HERE
...
2017-10-20 18:10:18 MODEL EXECUTION HAS FINISHED
you can calculate total time to run the model.
For example, when MPICH_GNI_DYNAMIC_CONN=enabled, then the above (18:10:18 - 17:21:34) time is just seconds larger than (Init Time + Run Time) from the acme timing summary file.
In contrast, when MPICH_GNI_DYNAMIC_CONN=disabled then the two model cost measures can differ by many minutes (as shown below). So it would appear that the above is a reliable way to measure total model cost.
I ran experiments using
-res ne120_ne120 -compset FC5AV1C-H01A
on Cori-KNL using 270, 675, and 1350 nodes, all with PIO_ROOT = 0. In all cases, using MPICH_GNI_DYNAMIC_CONN=disabled was faster, at least by a little bit. The biggest difference was with 1350 nodes (86400x1 ATM decomposition), as you would expect.
Ran this twice with both MPICH_GNI_DYNAMIC_CONN settings. There was some performance variability, but 'disabled' was always better. Fastest runs for each are described below (one time step, no restart write):
MPICH_GNI_DYNAMIC_CONN = enabled (so current default)
48m 4s by timers
- 46m 25s in Init Time
- 1m 39s in Run Time
48m 44s by job script output
MPICH_GNI_DYNAMIC_CONN = disabled
16m 34s minutes by timers
- 15m 58s in Init Time
- 1m 36s in Run TIme
36m 49s by job script output
So, almost 12 minutes faster by using MPICH_GNI_DYNAMIC_CONN = disabled. It is annoying that the model cost is not all captured in standard timing summary when using this settign, but the data is captured. I'll look into whether we can inject these data into the summary, but I would probably need help from CSEG.
In any case, please run your own experiments. This is all using MPI-only (though that should not make any difference), and we should verify that there is no impact on performance on the lower resolution models (should not be). This also makes my proposed pio_N2M replacement for pio_swapm in compute_counts irrelevant, and setting MPICH_GNI_DYNAMIC_CONN = disabled is more effective as well.
Hmm, I tried "MPICH_GNI_DYNAMIC_CONN = disabled" and it's a little slower, but more troubling is that my repos dating after Oct 16th are slower to init than before. On Oct 13th, I measured 687 seconds for comp_init_cc_atm -- cori-knl, 675 nodes, 43200 MPI's, 2 hyperthreads each. On Oct 16th, I meausure 1082 s (and 1285 secs with the disabling of above env var). All of my runs use PIO_ROOT=0. I'll keep trying.
Bummer - I thought that we had finished off this issue. I'll try to rerun some of my experiments as well.
Just now ran this same case (675 nodes, 43200 MPI's, 2 hyperthreads each).
a) "MPICH_GNI_DYNAMIC_CONN = enabled"
Init Time : 927.938 seconds
or 15m 28s
2017-10-25 17:13:57 MODEL EXECUTION BEGINS HERE
2017-10-25 17:31:46 MODEL EXECUTION HAS FINISHED
or 17m 49s
b) "MPICH_GNI_DYNAMIC_CONN = disabled"
Init Time : 413.762 seconds
2017-10-25 17:56:47 MODEL EXECUTION BEGINS HERE
2017-10-25 18:10:37 MODEL EXECUTION HAS FINISHED
or 13m 50s
So, I still see an advantage, even compared to just Init Time for (a).
@ndkeen , just to be clear, my case was
-compset FC5AV1C-H01A -res ne120_ne120
using master (updated today).
I just merged https://github.com/ACME-Climate/ACME/pull/2026 into master which sets: setenv MPICH_GNI_DYNAMIC_CONN=disabled (only for cori-knl and only for cray-mpich) I see ~25% improvement in init when I use this for a 43200-way F-compset.
I suspect we are not done with this issue, but this hopefully helps (and doesn't cause any other issues).
It's also possible that we might have to go back to using PIO_ROOT=1 for memory concerns. I currently still have it set to 0, but have been running into memory issues.
Noting that this issue & PR also had an impact.
https://github.com/E3SM-Project/E3SM/issues/2496
It looks like we have solved the issue of long OCN init times, however, now I'm seeing that as I add mpi tasks to the ATM, the init time increases more than I was wanting. If there is anything obvious here, let me know.
For a run where I used 1350 nodes, and 86400 MPI tasks the total ATM init time is 2571 seconds. This is a copy/paste of the top of model.timing.00000 (where ATM and the ioprocs) live. This is with a PIO stride of 64, so 1 ioproc per node which @mt5555 has been telling me is way too many. I will try some experiments using larger strides and see what happens.